Why is NCBI not used for training classifiers?

Ellenphant · May 1, 2020, 6:24pm

So maybe I am misunderstanding but from what I have seen people always using either GreenGenes or SILVA. Is there a reason why people don't use NCBI for classifiers?

Also do people normally update their classifiers with every new release of SILVA?

Just trying to get a better idea on the logic behind classifier things!

Nicholas_Bokulich · May 1, 2020, 7:02pm

Hi @Ellenphant,
Great questions, here are my thoughts:

Some people do, you can scrounge this forum for some cases where people are using/asking about using NCBI sequences to make their own reference databases. Some things to consider:

It can be difficult to build your own reference database "from scratch" (e.g., from NCBI)! That's the main reason not to use NCBI for standard markers that have curated reference databases available (like 16S). NCBI does have some refseqs (at least for 16S, maybe others), but these are built from type seqs so may or may not have uncultivated clades represented, and will still present some challenges to reformat.
Many of the curated reference databases are built from curated NCBI sequences, so they have done the hard work of figuring out what to include/exclude from those data, as well as drawing from other sources, etc... you should just read the SILVA/greengenes/GTDB papers to read about the reasons for using reference databases like this
using a standard reference database is a lot easier to reproduce... there is less guesswork to figure out what was done to format a collection of sequences, and it is easier for others (or yourself!) to reproduce your findings/apply your methods to new data. Releases are versioned so you can point to which release version was used, making this much more reproducible.

So reproducibility and ease of use are the main reasons to use something "out of the box" (SILVA, greengenes, GTDB, etc) instead of grabbing seqs from NCBI Genbank or another sequence repository.

You don't have to, but the new releases presumably are "better":

new sequences, updated taxonomies, maybe filtered out some garbage?
You'd need to see release notes for individual databases to be sure
but in general it is probably advantageous to train a new classifier from the latest database when a new release becomes available.

However that probably does not mean that you "need" to. Some reasons not to:

the updates may be trivial and you don't want the trouble! see the release notes to decide
a new release does not necessarily "invalidate" the old release (e.g., maybe new sequences are added, but maybe these don't impact taxa in your environment). See the release notes to decide.
You want to compare new results to old (or maybe even published) results — you should use the same classifier to compare these, if possible, to make sure that methods are uniform (e.g., if you see a species in dataset 1 but not 2, is it due to differences in the samples or differences in the reference database? make sure methods are uniform)

Ellenphant · May 2, 2020, 9:05pm

Thank you so much ! This clears up all of the uncertainty I was having and gives me a much better understanding of the logic behind it. Always before it just seemed like a "well we do it because we do it" sort of thing but it's great knowing the reasoning.

Nicholas_Bokulich · May 4, 2020, 7:16pm

2 posts were split to a new topic: Which reference database to use: SILVA, Greengenes, GTDB

system · June 5, 2020, 1:25am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.