I have been using the bioinformatics services provided by Novogene. We use the most cost efficient option which still uses QIIME and the Silva 138.1 database.
I have made my own classifier using the Greengenes 2022.10 backbone files and extracting the V3V4 region. This seems to identify sequences well using the tutorial datasets.
I reclassified the OTUs provided by Novogene with this classifier and I was expecting some differences but most of the OTUs have different identifications, and many are not simply different phylogenetic levels of identification that agree.
This is not surprising, and has much to do with how the reference databases are curated. There are many approaches and issues to deal with as described here:
It may come down to which of these classifications make sense given your study system. This is why we provide multiple avenues to classify sequences. You can also try the SILVA weighted classifiers.
In most cases, you are lucky to obtain a true species-level designation with short read data. See here:
I was mostly surprised how many were identified to different phylums and classes. I assumed these high level classifications would be pretty similar between databases.
I understand that the species classifications are not 100% accurate but my boss really prioritizes them so I want to maximize the species identifications.
I am assuming that the feature.tax_assignments.txt file is the one from SILVA? If so, when / how was the SILVA reference file made? Or are these taxonomic assignments from Novogene? I ask because when we generate SILVA reference files, we use "doimain" / "d__" and not "kingdom" / "k__".
Also, SILVA is also in the process of updating its taxonomy to be more similar to Greengenes 2 in some cases, e.g. as of SILVA 138.2, they've updated some phylum and genus labels. Perhaps try making your own SILVA reference database as outlined here, and re-classify? Keep in mind, many facilities might curate their own SILVA (or other) database differently and they may not even be equivalent to each other, leading to different assignments. You do not need to follow all the steps, in the tutorial. Much of it is just showing off all the things you can do.
I am assuming that the feature.tax_assignments.txt file is the one from SILVA? If so, when / how was the SILVA reference file made? Or are these taxonomic assignments from Novogene?
Yes, those are the assignments from Novogene. It uses the qiime and the 138.10 Silva database.
In addition to the caveats I presented earlier... Another thing I forgot to mention, the resulting classification also depends on what algorithm they used along with the reference database to classify your reads. Did they use BLAST, vsearch, naïve bayes, another algorithm? Using different algorithms on the same/similarly curated reference databases can also result in some differences.