Collapse taxonomy by tree tips instead of name


I am running into a problem with using a collapsed taxonomy to train models for machine learning (I’m training models outside qiime, not using the sample classifier plugin). I have trained models using each taxonomy level (i.e. genus vs species vs ASV level) to see which is most effective for generating predictions. I constructed these collapsed tables using the qiime taxa collapse command. When I do this analysis, the species level has proven the most accurate.

However, I am concerned that the collapsed data table isn’t representing actual “species” in the data. We sequenced our data using the V4 16S rRNA region with EMP primers. Because of this, I am aware that I shouldn’t expect species-level resolution with all of my sequences. But, based on my understanding, the way the taxa collapse function works it groups sequences into the same “species” by matching the names they were given with the taxonomy assignment. This makes sense, except that the naming scheme isn’t always consistent. For example, in my taxonomy I have k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;;;__ and k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__;g__;s__ included as separate species, though based on the level of resolution they reached it seems like they should be grouped together. My problem is similar to what’s posted here (Multiple entries after collapsing using taxa barplot), where it appears that if it has the letter it’s an unannotated species and if it just has underscores it’s not resolved to the species level, but this is still confusing as to why they are left in the table if they aren’t truly at the species level.

I can think of two methods to potentially resolve this, but I am unsure which is more scientifically correct and whether they can be done in QIIME2, so I’m hoping I can get some insight on that here :slight_smile:

  1. Is there a way to remove everything that didn’t resolve to a species level from the dataset? I feel like this is biasing the dataset, but may be more accurate if I’m reporting that this is a true species-level taxonomy.
  2. Does QIIME have a functionality similar to the phyloseq tip_glom function ( that would allow me to collapse the taxonomy based on a tree instead of the names?

Thank you so much in advance for any help you can provide!


Hi @aeriel.belk,
This problem is a good reason to keep things as ASVs in most cases (as they are usually more informative) though as you say you have found that species are better predictors here.

All in all, I think your idea #2 is probably the better way to go about this, and I have given a QIIME 2 solution below.

That’s a quirk of greengenes naming conventions, and these really do represent different things. They should not really be collapsed into one.

More of the issue you have is that all ASVs only classified to order level will be lumped, e.g., as k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales; ; ;__, when they may represent very diverse lineages (and classify as such if only you had a little more information…).

QIIME 2 is actually classifying this one to species level: k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__;g__;s__, but that is how the closest matches are actually annotated in greengenes: these are OTUs that represent diverse lineages but very similar 16S sequences. So this is probably distinct enough from the order-level classified ASVs that it should be kept separate.

The one is classified at species level (but the species annotation is missing). The other cannot be classified at that level and it is your job to remove it if that is really what you want to do.

Sure, qiime taxa filter-table will do this for you — see the example in the filtering tutorial (which shows phylum level, just include s__ instead of p__). This will filter out the underclassified but it will not filter out the classified-to-species-level-but-unannotated-in-greengenes (e.g., k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__;g__;s__). If you want to remove that you can do a second filter step with the same command to exclude g__;s__.

It is worth a try but I am not sure it will really accomplish what you want. It probably wouldn’t bias the results since you are not really adding information — you are removing it! So worse comes to worse I think you will just damage classification accuracy since you are losing features. Those features probably are not important anyway, since they presumably may represent diverse lineages (as it sounds like you are assuming), so it is worth a try…

No, but to accomplish more or less the same thing you could just cluster your ASVs into OTUs in QIIME 2!



This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.