It gets inconvenient when I want to count the number of taxa. I have to write a custom script that recognizes these "duplicates", then merge them by summing each abundance. Just wondering if anyone has done this before and can share the code. Otherwise I will post one.
Edit: Even better, will the Greengenes files for QIIME be redone to solve this upstream?
Edit 2: Sorry I didn't read the answer clearly in the other post. The explanation from @ebolyen is
"The one without the g / s was only assigned to k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Veillonellaceae but that isn’t 7 levels deep, so it qiime taxa collapse padded it out with __. The one that ended with g / s was assigned to an OTU that didn’t have a genus/species specific annotation in Greengenes."
So does that mean they should be treated as the same thing in a collapsed taxa table? If yes, then it makes sense to merge the two rows.
No. The distinction is that the first row (ending in __;__) cannot be confidently classified beyond family level (probably because a close match does not exist in the reference database). So sequences receiving that classification can be any taxon in f__Geodermatophilaceae. The second row (ending in g__;s__) DOES have a close match in the reference database and hence is confidently classified at species level — unfortunately, that close match does not have genus or species-level annotations. This does not in any way imply that these two different taxonomic affiliations are related beyond the family level, so it would probably be inappropriate (or at least presumptuous) to collapse these at species level.
I would recommend considering these to be unique taxa unless if you have other evidence that they should be collapsed. (e.g., reclassify these features with another classifier like classify-consensus-blast and/or with SILVA database to see if these give a better idea of what these features may represent — with the caveat that a more satisfying answer may not necessarily be the "correct" answer)
An easier way to do this for your use case (counting the number of unique taxa) would probably be to filter on taxonomy in QIIME2. Passing something like ";__" to the exclude parameter might accomplish what you are describing, though I have not tested this so cannot be sure.
Good luck! I hope that provides some more clarity.
Thank you so much for your reply. Yes things have clarified for me.
The qiime taxa filter-table command using --p-exclude ";__" finished successfully, but upon checking they were not filtered. I’ll file a bug report about this.
I was able to run the filter command on the original table; got error on collapsed table because taxonomy is recognized only as annotations to the long string ID.
But, now that I understand that those two are not to be treated as one taxa, I don’t need to merge them together now. Initially I was worried I am double-counting something.
Oh right this won't work because the empty annotations are only added after an action like collapse or generate a barplot (I think) — so this is not a bug and there's no reason to file a report.