Building a database in order to train a classifier

bugsinbugs · May 22, 2020, 9:30pm

I have datasets of amplicon generated for a single copy gene (460b.p.) within a specific gut related taxon (Genus level) that I want look through for species and strain variants. I have some questions about building the database to do this. I'm mostly wondering what range of sequence similarities I should use in the alignment? For example, I expect most of the sequences to be >90% sequence similarities with species clusters of 95-97% and I'll bet the strain variants I want are >99% similar. Should I use a bunch of highly similar (>99% sequences) for within each of the expected species? what about outgroups since there may be stuff in there I can't account for? Are there disadvantages to having this kind of discontinuous similarity in the database? Thanks!!!

Nicholas_Bokulich · May 29, 2020, 5:24pm

Welcome @bugsinbugs!

I am not sure I have a real answer to your question but here's my advice:

Yes. If that's the amount of similarity you expect and the degree of resolution you are interested in, get as much differentiation as possible.

If you are not interested in fully profiling those outgroups, I recommend putting in some of those sequences but maybe not all representatives. For example, for fungal ITS classification it can be a good idea to throw in a few plant ITS sequences as outgroups (plant ITS seqs are amplified by most fungal ITS primers but most people using those primers only care about the fungi). Any plant ITS will be classified as belonging to those seqs, even if it is not the correct plant species — but since any plant hits are thrown away, that level of specificity is not important.

I think not, as long as you are documenting and accounting for those caveats in your analysis (e.g., you expect good results on the well-covered clades, but you should always throw out anything classified to the outgroups).

But ultimately you should thoroughly validate this using appropriate test data!

Good luck!

bugsinbugs · June 2, 2020, 2:29pm

Thanks for the advice. The process seems to work though I have really great resolution within some of the strain groups (those that I was able to provide many closely related seqs) and not so great in others. I guess this is to be expected and will only get better as the databases get more comprehensive.