Running sanity checks on custom QIIME 2 classifiers

Hi @gmdouglas,
Sounds like you were suffering from hotspring-metagenome-itis, as you discovered: abnormally short sequences used as input to fit-classifier-naive-bayes can result in a wonky classifier. This leads to spurious misclassifications. :hotsprings:

This could arguably be considered a bug (e.g., maybe fit-classifier-naive-bayes should detect and weed out abnormally short sequences when the classifier is being trained?) but this really comes down to an age-old truth: junk in, junk out. :put_litter_in_its_place:

It is a rare issue, and we usually see this crop up when training a custom reference database or with specific primer sets. extract-reads has the min-length parameter as a safeguard to prevent this issue, but ultimately the onus is on every investigator to make sure that the reference data that they use is of high quality.

So thanks for this; having a sanity check like this is useful for testing the performance of custom classifiers. The tools in q2-quality-control are also designed for this type of purpose, and while they were originally intended for use with samples of "known" composition (e.g., mock communities or simulated samples), they can be used to check the consistency of taxonomic assignments in any samples; e.g., compare classifications with your custom database against a full-length 16S classifier (or complete database).

5 Likes