Building a database in order to train a classifier

Nicholas_Bokulich · May 29, 2020, 5:24pm

I am not sure I have a real answer to your question but here's my advice:

Yes. If that's the amount of similarity you expect and the degree of resolution you are interested in, get as much differentiation as possible.

If you are not interested in fully profiling those outgroups, I recommend putting in some of those sequences but maybe not all representatives. For example, for fungal ITS classification it can be a good idea to throw in a few plant ITS sequences as outgroups (plant ITS seqs are amplified by most fungal ITS primers but most people using those primers only care about the fungi). Any plant ITS will be classified as belonging to those seqs, even if it is not the correct plant species — but since any plant hits are thrown away, that level of specificity is not important.

I think not, as long as you are documenting and accounting for those caveats in your analysis (e.g., you expect good results on the well-covered clades, but you should always throw out anything classified to the outgroups).

But ultimately you should thoroughly validate this using appropriate test data!

Good luck!