Adding non-target DNA to Naive-Bayes classifier


If I were to add say 23,000 ‘non-target’ ITS2 sequences (Viridiplantae, Amoebozoa etc.) to my UNITE fungal database (58,000 sequences), how might that influence the accuracy of my feature classifier?

I know that adding a just a few non-target sequences is good and will decrease over-classification o f non-targets, but I am wondering if adding so many will have the opposite effect on ‘target’ sequence classification. That is, if I train my classifier on too much non-target DNA, will that result in a greater likelihood of an incorrect deeper (Genus/Species level) classification of target DNA?

I am running it both ways to compare, but I was hoping someone else might have some insight on this as I am somewhat ignorant of the exact mechanics behind the classifier.




That is a really fascinating question @Lorinda.

I do not know the true answer (which is probably very subjective to the content of the reference and query sequences) — or rather I do not know the magnitude of the effect.

It is probably not a major impactor — for example, we have pre-trained 16S classifiers from SILVA database for full-length SSU (16S + 18S!) and for 16S V4. As far as I know, the full-length (even with the 18S included) works reasonably well for classification of 16S sequences, though I have not really tested the accuracy of the FL vs. V4-only versions.

But it could theoretically have some effect — and I do not really see a benefit unless if you want to be able to characterize those non-target sequences (I assume that’s your plan).

Run both ways and let us know what you see! In this scenario it is probably safe to say that shallower classifications are a bad thing even if you do not know the true composition of your samples…

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.