Help explain taxa name format / assignment - taxonomy table

Hello,
I am new to using qiime2. I need help interpreting taxa name in the taxonomy table. Specifically, sometimes the species level is just s__, sometimes the assignment goes only to a family level, sometimes genus and species are empty g__;s__ . I am not clear on what all these combinations mean and I could not find appropriate document describing what mapping scenarios generate these results. Examples below. I need to have a clear understanding on what these mean so that I can properly parse/filter these taxonomy tables. Any advice and pointing me in the right direction would be much appreciated.

[1] “k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae; g__Escherichia; s__coli”
[2] “k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__”
[12] “k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae”
[35] “k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__; s__”
[89] “k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae”
[97] “k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__”

For example, does g__;s__ mean that the species and genus could not be assigned but the sequence forms a cluster at a species level (but unknown species?) ? If the assignemnt goes only to a family level, that abundance is collectively for the whole family?

Any help much appreciated! Cheers.

Hi @ondrej,
Great question. See this post (and the post linked from there) for discussion of the greengenes taxonomy formats (this is a quirk of the reference database, not a QIIME 2-specific format).

Let us know if you have any additional questions!

Thanks for the link. Helped quite a bit and explains the unusual annotation I mentioned earlier. One part of the question that I could not find answer to is that I find

d8ce5219469f0527953255eb3cd81283k__Bacteria 0.9999997385854641
99419875b54d71847e68372e40da119a k__Bacteria 0.9995299802494267
e3a18745d92d8540a29d6b9a5a75ab99 k__Bacteria 0.9999945921724368

Hi, thanks for the link to a very helpful post. It clears up many of the annotation questions I had. Perhaps one last bit that is not clear is that I find mutliple entries for k__Bacteria (for example) in the taxonomy.tsv file. There is nothing else appended like I was asking previously, no g__;s__ or ; . Simply multiple entries for the same taxon, but different Feature ID. Do these mean different unassigned clusters for bacteria? The only way to help me with filtering based in feature ID is to collapse taxa in qiime and go off of that?

Feature ID Taxon Confidence
d8ce5219469f0527953255eb3cd81283 k__Bacteria 0.9999997385854641
99419875b54d71847e68372e40da119a k__Bacteria 0.9995299802494267
e3a18745d92d8540a29d6b9a5a75ab99 k__Bacteria 0.9999945921724368
4e074c0394d77c26dd30bab7b1d37dfb k__Bacteria 0.9497216382927974

When you perform taxonomy classification, you are classifying each sequence independently. These sequences can frequently have the same taxonomic classification — a unique sequence or OTU does not in any way imply that it would be taxonomically distinct from the others. So this is especially true for sequences that cannot be classified beyond kingdom level, but will also frequently occur at species level.

So what you are seeing is multiple sequences that cannot be confidently classified beyond kingdom level.

If you want to filter out specific taxa, you can do so as described in this tutorial. That would be a wise thing to do, and I would certainly remove anything that cannot classify at least to phylum level (these are usually non-target DNA) — unless if you suspect a classification error (e.g., if you have many unclassified sequences you probably did something incorrect during classification).

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.