After processing my bacterial 16S rRNA gene sequences in QIIME 2, I wish to conduct statistical analyses with Phyloseq and investigate whether there are differences in the bacterial community composition of my samples exposed to different treatments.
I am concerned about how to deal with OTUs having an unassigned taxonomy. Indeed, among the 16’393 obtained genera, 6586 are “NA”, 2233 are “D_5__uncultured”, 705 are "D_5__uncultured bacterium” and 1044 are just “D_5__”. Altogether, there are therefore 64.5% of unassigned taxa at the genus level! Many of these (especially the NA) are also unassigned at higher taxonomic levels.
I seems to me that aggregating NAs into a same category is incorrect as it might group OTUs that are actually not closely related. On the other hand, the amount of unassigned OTUs is so important that they shouldn’t be discarded…
May I please ask for your advice on how to deal with this issue?
(I classified my sequences using the SILVA 128 release at 99%).
I think this detail of taxonomy assignment is common for 16S OTUs. One of the issues we face is that some species will have very different 16s genes, so you can totally classify them to the species level, while other species will have the exact same DNA in the region of the 16s gene you sequenced, so they all look the same based only on their 16s amplicon. In this second case, getting a species level assignment would be impossible using the 16s gene alone.
You could also be limited by your database. If a species has a unique 16s amplicon, that won't be identified if only related genera are in the database you are searching. SILVA is great, but no database is fully complete.
Well said. Discarding and merging are both imperfect.
I just keep all my unassigned OTUs and report their lowest taxonomic depth. Neither Qiime nor Phyloseq cares if the OTUs say NA, and they can both do OTU level analysis and report any taxonomy you have. Take a look at this DESeq2 analysis in Phyloseq in which many NA OTUs are reported without issue. https://joey711.github.io/phyloseq-extensions/DESeq2.html
I know that's not really a 'solution' to this problem, but I hope this helps you continue with confidence.
Colin
The ambiguous "D_5__" assignment is indeed common (that query seq matches a sequence in the reference database that has no annotation), and assignment to "uncultured" organisms in the database is also unavoidable (unless if you remove those sequences from your reference database — however, there probably is some other taxonomic information, e.g., a family-level annotation, so it is not as though these species are entirely unknown).
However, depending on your sample types, the number of NA classifications seems exceptionally high (I am assuming that you mean that these received no classification even at kingdom level, not that they were unassigned at genus level). That's around 1/3 of your sample unclassified — and for some sample types may make a lot of sense, but for well-described sample types the number of totally unclassified sequences should be much lower. If these NAs are just at genus level, that makes sense — many clades are difficult to resolve even at genus level (and especially depending on read length, etc).
In QIIME2, at least (e.g., with the q2-composition plugin), it is possible to look at differentially abundant features at two different levels: as sequence variants/OTUs, or as taxonomic annotations. It sounds like you are currently doing the latter — collapsing your features into genus-level taxa and looking for differentially abundant genera. You could also just determine whether the individual features are differentially abundant, e.g., with ANCOM (this is the method in q2-composition). It decreases statistical power since you are adjusting for many more tests, but it would provide a little more granularity in your analysis. The other benefit is that you don't even need to know/care what the taxonomic classification is until you determine whether that feature is significant; if it is significant but the taxonomic classification is ambiguous, you can take a closer look (e.g., by reclassifying with a different classifier or even using NCBI blast for a second opinion).
@colinbrislawn makes a great point here, and this relates back to my comment about sample type. You are always constrained by the reference database in two ways: (1) if your species is not present in the reference database, you will not get a confident/high-quality classification (and worse yet, you may get a misclassified sequence). (2) if the database contains a lot of uninformatively annotated reference sequences (e.g., "uncultured bacterium" without any other information), then you will get a lot of junk classifications. These are both problems with any and all publicly available 16S database that I am aware of!
Thank you very much for having taken the time to answer so quickly and for your very interesting and informative comments and advice. It is really of a great help!
That is true, no database is perfect and it is sometimes an important limitation for our analyses…
@colinbrislawn, thank you very much for the link to DESeq2. May I just ask a question about reporting OTUs at the lowest taxonomic depth? For example, if I have the following OTUs at the genus level:
Both could be listed as “Unidentified Gracilibacteria”. However, since the sequences were classified in these two different groups, it would be incorrect to merge them together under a single category for the statistics, right? Should we keep the two groups separated when doing statistical tests but maybe pool them together when representing abundance graphs?
@Nicholas_Bokulich, sorry, I may have used wrong words when reporting the numbers of unassigned taxa I obtained: by “NA” I meant the taxa that had no annotation at the genus level (such as “D_0__Bacteria;D_1__Proteobacteria;D_2__Alphaproteobacteria;;;__”). At higher taxonomic ranks, only 0.25% of my sequences have no assigned phylum for example.
Thank you also very much for the suggestion to try the q2-composition plug-in by using sequence variants instead of taxonomic annotations for the statistical analysis. I had indeed been trying the later. I will definitely explore the possibility of using ANCOM!
It's a tough question, and I think the short answer is that for the sake of simplicity I would just keep these separate for statistics and probably plotting (unless if you are working with a short list of taxa for either). (but @colinbrislawn and others may have different opinions on this!)
Long answer: In this specific case, it probably makes sense to merge, but I would personally not bother too much with such details unless if these taxa are of particular interest to you. it is impossible to tell at face value if these annotations are meant to differentiate two groups (probably not) and if you are picking through a long list of taxonomic results you would just be making too many arbitrary assumptions. You could examine the sequences associated with these two different annotations to see if they really do represent two distinct groups, but it's probably not worth the effort unless if this group is important to your hypothesis. With such a vague annotation, I do not think it is worth the trouble.
Excellent... that sounds much more typical! You had me worried for a moment that there was a technical issue we needed to iron out. Thanks for clarifying.
This is exactly what I do. For a stat test program like DESeq2 I keep all my OTUs / features as they are and report any taxonomy I have. But when I'm making a bar plot I use tax_glom() to keep is simple for the readers.
Keep up the good work and the great questions!
Colin
Thank you very much again for having taken the time to answer in details and for your explanations. I feel now much more confident on how to proceed and I have certainly learnt a lot.