different results from self-trained full-length and v4 classifiers

Nicholas_Bokulich · February 13, 2020, 6:08pm

Hi @Xiaolan_Lin,
As @colinbrislawn noted, differences are expected between classifiers trained on full-length 16S vs. variable regions. Your differences may be unexpected, but I see one major issue that should be addressed first before evaluating:

You have a very large number of unclassified reads. This appears to be due to the way you have structured the taxonomy annotations, see this topic for more details:

Once that's fixed you will be able to better evaluate how different these classifiers are performing.

Something else to evaluate is how many sequences are being lost during read extraction. If your primers have poor coverage, you could be losing many reference sequences in the process. Low coverage (lots of sequences lost after read extraction) will impact classification results... it will make a low-quality classifier!

One issue to be aware of when making a custom database is that you want to have "outgroups" in your data — otherwise the classifier could be prone to false-positive identification because you are overfitting to a restricted set of possible taxa. SILVA may not be the best fit for your goal, e.g., because misannotated and unannotated sequences are a known issue in this and most reference databases, but making a custom database is a more challenging task than first meets the eye! You are probably aware, but I just want to point this out for others following along.