If I were to add say 23,000 ‘non-target’ ITS2 sequences (Viridiplantae, Amoebozoa etc.) to my UNITE fungal database (58,000 sequences), how might that influence the accuracy of my feature classifier?
I know that adding a just a few non-target sequences is good and will decrease over-classification o f non-targets, but I am wondering if adding so many will have the opposite effect on ‘target’ sequence classification. That is, if I train my classifier on too much non-target DNA, will that result in a greater likelihood of an incorrect deeper (Genus/Species level) classification of target DNA?
I am running it both ways to compare, but I was hoping someone else might have some insight on this as I am somewhat ignorant of the exact mechanics behind the classifier.
That is a really fascinating question @Lorinda.
I do not know the true answer (which is probably very subjective to the content of the reference and query sequences) — or rather I do not know the magnitude of the effect.
It is probably not a major impactor — for example, we have pre-trained 16S classifiers from SILVA database for full-length SSU (16S + 18S!) and for 16S V4. As far as I know, the full-length (even with the 18S included) works reasonably well for classification of 16S sequences, though I have not really tested the accuracy of the FL vs. V4-only versions.
But it could theoretically have some effect — and I do not really see a benefit unless if you want to be able to characterize those non-target sequences (I assume that’s your plan).
Run both ways and let us know what you see! In this scenario it is probably safe to say that shallower classifications are a bad thing even if you do not know the true composition of your samples…
This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.