Poor taxonomy classification on species level


I'm using SILVA 138.1 to construct taxonomy for biome 16S analysis. I got NR99 version and used some additional filters ( remove uncultured and unidentified taxons, also taxons which not represented at species level )

I used next commands to create classificator on V4 regions

qiime feature-classifier extract-reads  --i-sequences Silva_NR99_filter.qza  --p-f-primer GTGCCAGCMGCCGCGGTAA --p-r-primer GGACTACHVGGGTWTCTAAT  --p-trunc-len 250 --p-min-length 50 --p-max-length 450 --o-reads Silva_NR99_filter_V4.qza
qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads Silva_NR99_filter_V4.qza --i-reference-taxonomy Silva_tax.qza --o-classifier classifier_Silva_V4.qza

then after running qiime2 pipeline I got not very good results. At species level I have 186/341 classified (54%). But on genera level it much better - 170/200 (85%) - but also should be a little better on practice

I tried to undetstand why I have such bad results on species level. Particularly I considered "Akkermansia" genera. So in fact my pipeline didn't recognize "Akkermansia muciniphila" but recognize only "Akkermansia".

I tried to keep only one strain "Akkermansia muciniphila" from Silva 138.1 Db. With such settings pipeline was done fine and I got "Akkermansia muciniphila" in results. But when I kept both "Akkermansia muciniphila" and "Akkermansia muciniphila ATCC BAA-835", the classification gave me only "Akkermansia" genera.

So it seems that such recognition problem is related to redundancy in some sence in sequences used for taxonomy. Also it looks that there could be some solution for reduce this redundancy - for example like removing all species except "Akkermansia muciniphila" - because "Akkermansia muciniphila" in results instead "Akkermansia muciniphila ATCC BAA-835" is much better than just "Akkermansia" on genera level

Do you have some suggestions how could such results be improved? IMO improving to 75-80% recognition on species level will be adequate result

Thank you for your attention : )

Hi @biojack ,

The V4 region of 16S cannot reliably differentiate most species, so this result is quite expected. Also at genus level 85-90% classification is normal. This is because of the limited information content of the short hypervariable regions vs. full 16S. For more prose and data on this, see the previous work on this starting with RDP's benchmarks, and then our own many benchmarks for the classifiers used in QIIME 2:


Note that SILVA does not curate their taxonomy at species level. They take the info directly from NCBI. So a very large proportion are missing or do not match the genus label (for data and more discussion of this and related topics, see the next paper linked below). This is also why you see annotations with strain labels like this, because that is the raw label from the GenBank accession:

Quick note on Akkermansia: this is a good example of how database "noise" does create issues (more on this below), but Akkermansia is also a bit of a poor representative for other more populated clades in the database, since the genus only have a couple species (and I think only A. muciniphila is in SILVA if I recall correctly). Other genera have genuine problems resolving at species level because those species have very similar or identical 16S sequences (esp. when looking at only short amplicons).

Yes. We released a QIIME 2 plugin a little while ago that will let you create such databases and perform various filtering steps (and record all of these edits in QIIME 2's provenance for full traceability of what you did, so you can easily do it again :grin: ). This includes a method to edit-taxonomy, which you could use with a regular expression to trim off strain information... or filter-taxa that are missing information at specific ranks, etc. You can find a tutorial for using this with SILVA here:

We used RESCRIPt to do quite a few database and filtering benchmarks for various marker genes and dbs. You can see that these edits lead to ~75% classification accuracy at species level with SILVA. Removing redundant/confusing/mis/unannotated entries from the database really does help, and that was one original motivation for this plugin.

This is based on cross-validated classification of the known reference sequences, so the 75% classification accuracy here should not be interpreted as "this method is only accuracy 75% of the time". Rather, this is telling you how much species-level resolution you could possibly expect when studying a community with mostly well-characterized microbial composition. Mileage may vary, as some systems contain many more uncharacterized microbes than others (and hence may be poorly represented by a given database). QIIME 2's classifiers account for this by using confidence intervals to prevent the classifier from giving a species (or genus)-level classification when there are multiple possible hits at that level. So the low number of species-level hits in a real database can be related to noisy databases or underperforming methods for sure, but at least 20-25% of the lack of resolution is due to the inability to resolve individual species using very short DNA segments.

But there are other approaches that can improve this, even up to around 90% classification accuracy at species level... see the benchmarks and papers above for mcuh more description of the various methodological approaches and limitations that we have explored.

Good luck!


Thank you @Nicholas_Bokulich very much !

I know about RESCRIPt, I tried once to use id dereplication step, but it was not give notable effect. But I will try to learn it more preciesly.

I will try your suggestions and return with new results

Also, if Silva is not curated species level taxonomy - would be better to use some another database for species recognition - GTDB or smth else?

yes the impact of dereplication on accuracy is negligible, but it should improve performance (speed). Other steps, e.g., using RESCRIPt to edit the taxonomy to remove strain labels, would help a little more for specific clades.

Yes, but at a cost.

GTDB is more computer-digestible (i.e., better ability to resolve clades) in the sense that they have already worked to disentangle polyphyletic clades. But this is by introducing unofficial taxonomic labels (bypassing the usual formal steps for naming new clades). So it comes with theoretical and practical costs: at the very least you (and collaborators, etc) might find it hard to interpret the results and compare to the literature using other (e.g., the official) taxonomic naming schemes. I like GTDB but it gives some people indigestion, so these are costs to weigh.

Also, GTDB does not totally "solve" this issue, because V4 is still a short sub-region of the 16S with limited information content, so you can still only resolve species 75-80% of the time (even in GTDB). If you really want something radically different, see the third paper I linked above (Kaehler et al. 2019), this brings the accuracy rate to around 90% at species level, but it is a more complicated approach.

If you want some examples, you can see those benchmarks in the RESCRIPt paper, comparing GTDB, SILVA, NCBI Refseqs, and Greengenes (on both simulated and real data).