Hello,
I am new to using qiime2. I need help interpreting taxa name in the taxonomy table. Specifically, sometimes the species level is just s__, sometimes the assignment goes only to a family level, sometimes genus and species are empty g__;s__ . I am not clear on what all these combinations mean and I could not find appropriate document describing what mapping scenarios generate these results. Examples below. I need to have a clear understanding on what these mean so that I can properly parse/filter these taxonomy tables. Any advice and pointing me in the right direction would be much appreciated.
For example, does g__;s__ mean that the species and genus could not be assigned but the sequence forms a cluster at a species level (but unknown species?) ? If the assignemnt goes only to a family level, that abundance is collectively for the whole family?
Hi @ondrej,
Great question. See this post (and the post linked from there) for discussion of the greengenes taxonomy formats (this is a quirk of the reference database, not a QIIME 2-specific format).
Thanks for the link. Helped quite a bit and explains the unusual annotation I mentioned earlier. One part of the question that I could not find answer to is that I find
Hi, thanks for the link to a very helpful post. It clears up many of the annotation questions I had. Perhaps one last bit that is not clear is that I find mutliple entries for k__Bacteria (for example) in the taxonomy.tsv file. There is nothing else appended like I was asking previously, no g__;s__ or ; . Simply multiple entries for the same taxon, but different Feature ID. Do these mean different unassigned clusters for bacteria? The only way to help me with filtering based in feature ID is to collapse taxa in qiime and go off of that?
When you perform taxonomy classification, you are classifying each sequence independently. These sequences can frequently have the same taxonomic classification — a unique sequence or OTU does not in any way imply that it would be taxonomically distinct from the others. So this is especially true for sequences that cannot be classified beyond kingdom level, but will also frequently occur at species level.
So what you are seeing is multiple sequences that cannot be confidently classified beyond kingdom level.
If you want to filter out specific taxa, you can do so as described in this tutorial. That would be a wise thing to do, and I would certainly remove anything that cannot classify at least to phylum level (these are usually non-target DNA) — unless if you suspect a classification error (e.g., if you have many unclassified sequences you probably did something incorrect during classification).