Running sanity checks on custom QIIME 2 classifiers

I’m posting a simple pipeline that I developed to quickly validate the taxonomic assignments of a new QIIME 2 taxonomic classifier (described at

I wanted to make a quick way to run sanity checks on QIIME 2 taxonomic assignments, because I recently came across a major issue with a custom taxonomic classifier I had created (see original issue) with qiime2-2019.7. Essentially all ASVs were being classified as the genus Alteromonas despite the 16S sequence being extremely different and matching different phyla based on BLAST.

It turns out that using these more restrictive options for qiime feature-classifier extract-reads fixed the major problem of Alteromonas popping up, which was also discussed in a similar thread.

  --p-identity 0.9 \
  --p-min-length 300 \
  --p-max-length 500 \

However, I wanted to make sure there weren’t any other issues specific to rarer lineages that might not be easy to notice (i.e. if similar misclassification issues were happening for a subset of ASVs they wouldnt be as easy to notice). The simple pipeline I came up with is based on ordering all ASVs based on the number of taxonomic label mismatches between QIIME 2 and an independent assignment approach. This lets you run a quick sanity check that there aren’t any major misclassification errors (which there weren’t in my case after using the new qiime feature-classifier extract-reads options!) and I think others might find this useful too.


Also it might be useful for others to know that the original error I referred to (calling all ASVs as Alteromonas) didn't occur on every test dataset, so it depended on the input data to a certain degree. The dataset I was using for testing was only based on ~50 ASVs (dataset3 below) and the error only emerged when datasets with more common diversity were used. I think this means that users should make sure they test their custom classifiers with a wide range of input ASVs.

Also just for future reference this error was resolved not only by using different options for extract-reads, but also by using non-default options when creating the classifier itself (the qiime feature-classifier fit-classifier-naive-bayes alpha and ngram options). The breakdown on some tests I ran are shown below - red indicates almost all ASVs were called as Alteromonas and blue indicates that the assignments appeared correct based on manual BLAST.

I played around with these settings since they had changed since qiime2-2018-2.0 (when I first created similar classifiers). Interestingly using the older option of ngram=8 solved the classifications for dataset2, but not for dataset1. Again I think this indicates that sanity checks need to be run on a range of ASVs when validating custom classifiers... Unfortunately I can't make these 3 test datasets available.

Hi @gmdouglas,
Sounds like you were suffering from hotspring-metagenome-itis, as you discovered: abnormally short sequences used as input to fit-classifier-naive-bayes can result in a wonky classifier. This leads to spurious misclassifications. :hotsprings:

This could arguably be considered a bug (e.g., maybe fit-classifier-naive-bayes should detect and weed out abnormally short sequences when the classifier is being trained?) but this really comes down to an age-old truth: junk in, junk out. :put_litter_in_its_place:

It is a rare issue, and we usually see this crop up when training a custom reference database or with specific primer sets. extract-reads has the min-length parameter as a safeguard to prevent this issue, but ultimately the onus is on every investigator to make sure that the reference data that they use is of high quality.

So thanks for this; having a sanity check like this is useful for testing the performance of custom classifiers. The tools in q2-quality-control are also designed for this type of purpose, and while they were originally intended for use with samples of “known” composition (e.g., mock communities or simulated samples), they can be used to check the consistency of taxonomic assignments in any samples; e.g., compare classifications with your custom database against a full-length 16S classifier (or complete database).


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.