Hello!
I am running into a problem with using a collapsed taxonomy to train models for machine learning (I'm training models outside qiime, not using the sample classifier plugin). I have trained models using each taxonomy level (i.e. genus vs species vs ASV level) to see which is most effective for generating predictions. I constructed these collapsed tables using the qiime taxa collapse command. When I do this analysis, the species level has proven the most accurate.
However, I am concerned that the collapsed data table isn't representing actual "species" in the data. We sequenced our data using the V4 16S rRNA region with EMP primers. Because of this, I am aware that I shouldn't expect species-level resolution with all of my sequences. But, based on my understanding, the way the taxa collapse function works it groups sequences into the same "species" by matching the names they were given with the taxonomy assignment. This makes sense, except that the naming scheme isn't always consistent. For example, in my taxonomy I have k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;;;__ and k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__;g__;s__ included as separate species, though based on the level of resolution they reached it seems like they should be grouped together. My problem is similar to what's posted here (Multiple entries after collapsing using taxa barplot), where it appears that if it has the letter it's an unannotated species and if it just has underscores it's not resolved to the species level, but this is still confusing as to why they are left in the table if they aren't truly at the species level.
I can think of two methods to potentially resolve this, but I am unsure which is more scientifically correct and whether they can be done in QIIME2, so I'm hoping I can get some insight on that here
- Is there a way to remove everything that didn't resolve to a species level from the dataset? I feel like this is biasing the dataset, but may be more accurate if I'm reporting that this is a true species-level taxonomy.
- Does QIIME have a functionality similar to the phyloseq tip_glom function (tip_glom: Agglomerate closely-related taxa using single-linkage... in phyloseq: Handling and analysis of high-throughput microbiome census data) that would allow me to collapse the taxonomy based on a tree instead of the names?
Thank you so much in advance for any help you can provide!
Aeriel