Comparison of OTU Identifications with Silva 138.1 and GreenGenes 2022.10 databases

I have been using the bioinformatics services provided by Novogene. We use the most cost efficient option which still uses QIIME and the Silva 138.1 database.

I have made my own classifier using the GreenGenes 2022.10 backbone files and extracting the V3V4 region. This seems to identify sequences well using the tutorial datasets.

I reclassified the OTUs provided by Novogene with this classifier and I was expecting some differences but most of the OTUs have different identifications, and many are not simply different phylogenetic levels of identification that agree.

I have attached the 2 taxonomy tables.

I chose to switch to GreenGenes because of the warning regarding species identification with Silva on the qiime2 documentation page.
feature.tax_assignments.txt (3.5 KB)
reclassified_features.txt (3.4 KB)

Has anyone else seen similar results? Which database do you prefer?

HI @Bonita_Mc,

This is not surprising, and has much to do with how the reference databases are curated. There are many approaches and issues to deal with as described here:

It may come down to which of these classifications make sense given your study system. This is why we provide multiple avenues to classify sequences. You can also try the SILVA weighted classifiers.

In most cases, you are lucky to obtain a true species-level designation with short read data. See here:

-Mike

Thanks for the paper links.

I was mostly surprised how many were identified to different phylums and classes. I assumed these high level classifications would be pretty similar between databases.

I understand that the species classifications are not 100% accurate but my boss really prioritizes them so I want to maximize the species identifications.