Poor taxonomy classification on species level

Nicholas_Bokulich · May 13, 2022, 6:59am

The V4 region of 16S cannot reliably differentiate most species, so this result is quite expected. Also at genus level 85-90% classification is normal. This is because of the limited information content of the short hypervariable regions vs. full 16S. For more prose and data on this, see the previous work on this starting with RDP's benchmarks, and then our own many benchmarks for the classifiers used in QIIME 2:
https://journals.asm.org/doi/10.1128/AEM.00062-07#F1

https://www.nature.com/articles/s41467-019-12669-6

Note that SILVA does not curate their taxonomy at species level. They take the info directly from NCBI. So a very large proportion are missing or do not match the genus label (for data and more discussion of this and related topics, see the next paper linked below). This is also why you see annotations with strain labels like this, because that is the raw label from the GenBank accession:

Quick note on Akkermansia: this is a good example of how database "noise" does create issues (more on this below), but Akkermansia is also a bit of a poor representative for other more populated clades in the database, since the genus only have a couple species (and I think only A. muciniphila is in SILVA if I recall correctly). Other genera have genuine problems resolving at species level because those species have very similar or identical 16S sequences (esp. when looking at only short amplicons).

Yes. We released a QIIME 2 plugin a little while ago that will let you create such databases and perform various filtering steps (and record all of these edits in QIIME 2's provenance for full traceability of what you did, so you can easily do it again ). This includes a method to edit-taxonomy, which you could use with a regular expression to trim off strain information... or filter-taxa that are missing information at specific ranks, etc. You can find a tutorial for using this with SILVA here:

We used RESCRIPt to do quite a few database and filtering benchmarks for various marker genes and dbs. You can see that these edits lead to ~75% classification accuracy at species level with SILVA. Removing redundant/confusing/mis/unannotated entries from the database really does help, and that was one original motivation for this plugin.

This is based on cross-validated classification of the known reference sequences, so the 75% classification accuracy here should not be interpreted as "this method is only accuracy 75% of the time". Rather, this is telling you how much species-level resolution you could possibly expect when studying a community with mostly well-characterized microbial composition. Mileage may vary, as some systems contain many more uncharacterized microbes than others (and hence may be poorly represented by a given database). QIIME 2's classifiers account for this by using confidence intervals to prevent the classifier from giving a species (or genus)-level classification when there are multiple possible hits at that level. So the low number of species-level hits in a real database can be related to noisy databases or underperforming methods for sure, but at least 20-25% of the lack of resolution is due to the inability to resolve individual species using very short DNA segments.

But there are other approaches that can improve this, even up to around 90% classification accuracy at species level... see the benchmarks and papers above for mcuh more description of the various methodological approaches and limitations that we have explored.

Good luck!