Hello everyone!
I'm analysing some 16S data from dogs' gut microbiota. The data were obtained by sequencing V3-V4 regions. I used the q2-feature-classifier classify-sklearn to annotate the taxonomy (using the latest 2024.09 GreenGenes2 release as a reference database).
I want to collapse my ASV table to a taxonomic level (e.g., genus) for some downstream analysis.
My question is, what should I do with the ASVs that lack taxonomic annotation at the level I want to collapse?
Like for example I want to collapse to the genus level but some ASV are annotated only to the family level. d__Bacteria; p__Bacillota_A_368345; c__Clostridia_258483; o__Oscillospirales; f__Ruminococcaceae
Should I eliminate those ASVs (and lose a considerable portion of my data) or is it reasonable to assign them generic genus-level annotations, like: d__Bacteria; p__Bacillota_A_368345; c__Clostridia_258483; o__Oscillospirales; f__Ruminococcaceae_1 d__Bacteria; p__Bacillota_A_368345; c__Clostridia_258483; o__Oscillospirales; f__Ruminococcaceae_2
My problem with this solution is that the two ASVs may belong to the same genus and thus considering them as separate may not be appropriate.
Does anyone have some suggestions? Thanks!
Great question. These ASVs fail to annotate completely because they are insufficiently distinct to match them to a single species, and in this case they are similar to multiple genera in the same family. There is not something wrong with that ASV, so you should not remove it — it's just that it lacks sufficient signal to map it to an individual species.
This is a common issue with 16S amplicons sequenced with Illumina/short-read sequencers.
No! these are perfectly good sequences, but they just map to more than one genus (due to the limited length and heterogeneity of the 16S domain).
This also has some other issues. The biggest being that the nomenclature you've used in that example implies (to me) that these are distinct families, but that is not the case. Those ASVs map to 2 or more genera in the family, so they belong in that family but the genus and species ID are unknown. You could consider something like this: d__Bacteria; p__Bacillota_A_368345; c__Clostridia_258483; o__Oscillospirales; f__Ruminococcaceae;g__unknown_1 d__Bacteria; p__Bacillota_A_368345; c__Clostridia_258483; o__Oscillospirales; f__Ruminococcaceae;g__unknown_2
But then this has a few more issues:
you are making quite some modifications to the taxonomic nomenclature
you are assuming that these ASVs actually represent different "unknown genera" — it is very possible that they are actually two different ASVs from the same genus (two different species or strains of the same lineage; or maybe even copy variants of the 16S from the same cell!). So it is a problematic assumption that you cannot solve here.
you will have many such unknowns so I don't think that collapsing these would really help visualization or understanding of the community (but that's more a matter of personal taste I think).
So I would recommend:
when you want to really keep that granular information that keeps these ASVs separate, perform your analysis at the ASV level. E.g., perform diversity analyses with ASVs instead of collapsed taxonomies.
when you want to collapse taxonomy for visualization, let go of that granular information. This is the point of collapsing: to reduce complexity. So in those cases when you really want to collapse, just accept that the best you can really do is (in this example) collapse to: d__Bacteria; p__Bacillota_A_368345; c__Clostridia_258483; o__Oscillospirales; f__Ruminococcaceae;__;__
otherwise it's not really collapsing, is it?