Comparison of OTU Identifications with Silva 138.1 and Greengenes 2022.10 databases

I have been using the bioinformatics services provided by Novogene. We use the most cost efficient option which still uses QIIME and the Silva 138.1 database.

I have made my own classifier using the Greengenes 2022.10 backbone files and extracting the V3V4 region. This seems to identify sequences well using the tutorial datasets.

I reclassified the OTUs provided by Novogene with this classifier and I was expecting some differences but most of the OTUs have different identifications, and many are not simply different phylogenetic levels of identification that agree.

I have attached the 2 taxonomy tables.

I chose to switch to Greengenes because of the warning regarding species identification with Silva on the qiime2 documentation page.
feature.tax_assignments.txt (3.5 KB)
reclassified_features.txt (3.4 KB)

Has anyone else seen similar results? Which database do you prefer?

HI @Bonita_Mc,

This is not surprising, and has much to do with how the reference databases are curated. There are many approaches and issues to deal with as described here:

It may come down to which of these classifications make sense given your study system. This is why we provide multiple avenues to classify sequences. You can also try the SILVA weighted classifiers.

In most cases, you are lucky to obtain a true species-level designation with short read data. See here:

-Mike

Thanks for the paper links.

I was mostly surprised how many were identified to different phylums and classes. I assumed these high level classifications would be pretty similar between databases.

I understand that the species classifications are not 100% accurate but my boss really prioritizes them so I want to maximize the species identifications.

@Bonita_Mc, is this perhaps driven by revisions in the nomenclature? We've previously observed high correlation with SILVA

Best,
Daniel

1 Like

The dataset is only 32 OTU's.

1 differs because of a change in taxonomy but as far as I can tell the others are just being identified as different phyla.

I can go recount but I think there are 4/32 that are in different phyla or class. The others are more minor differences that I was expecting.

Hi @Bonita_Mc,

I am assuming that the feature.tax_assignments.txt file is the one from SILVA? If so, when / how was the SILVA reference file made? Or are these taxonomic assignments from Novogene? I ask because when we generate SILVA reference files, we use "doimain" / "d__" and not "kingdom" / "k__".

Also, SILVA is also in the process of updating its taxonomy to be more similar to Greengenes 2 in some cases, e.g. as of SILVA 138.2, they've updated some phylum and genus labels. Perhaps try making your own SILVA reference database as outlined here, and re-classify? Keep in mind, many facilities might curate their own SILVA (or other) database differently and they may not even be equivalent to each other, leading to different assignments. You do not need to follow all the steps, in the tutorial. Much of it is just showing off all the things you can do.

1 Like

I am assuming that the feature.tax_assignments.txt file is the one from SILVA? If so, when / how was the SILVA reference file made? Or are these taxonomic assignments from Novogene?

Yes, those are the assignments from Novogene. It uses the qiime and the 138.10 Silva database.

Good to know. Thanks!

In addition to the caveats I presented earlier... Another thing I forgot to mention, the resulting classification also depends on what algorithm they used along with the reference database to classify your reads. Did they use BLAST, vsearch, naïve bayes, another algorithm? Using different algorithms on the same/similarly curated reference databases can also result in some differences.