Evaluate custom taxonomic classifier with RESCRIPt?

sibilant · November 23, 2025, 7:24pm

Greetings Q2 and RESCRIPt community,

I am using qiime2-amplicon-2025.7, and am new to qiime2.

I am analyzing ITS1 amplicon metagenomic data from MiSeq Illumina runs. I am benchmarking my pipeline with mock ITS datasets, using tutorials for DADA2 with the UNITE database.

I followed Q2 forum recommendations to filter the UNITE database to remove unclassifieds, which means leaving the nice pre-trained UNITE classifiers for now. I used RESCRIPt to filter out unclassified sequences and other recommended curation steps.

I wanted to use the settings for naive-Bayes classifiers with UNITE recommended in Bokulich et al 2018 ‘Microbiome’, which specified k-mers of [6,6]. So, I trained my classifier as follows.

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-taxonomy taxonomy_UNITEdb_noSH_derep_20251118.qza
--i-reference-reads sequences_filtered_derep_culled_UNITEdb_20251118.qza
--o-classifier classifier_UNITEdb_20251122.qza
--p-feat-ext--ngram-range '[6, 6]'

This appeared to work great. So, now I have custom UNITE seq and taxonomy db, and a corresponding custom classifier trained on those dbs.

I would have preferred to use the combined training and evaluation of classifier tool provided: qiime rescript evaluate-fit-classifier.
But, I did not see a way to specify the ngram range in this RESCRIPt wrapper of qiime2’s feature-classifier, and it looks like the default is [7,7].

Next comes evaluation.
From the RESCRIPt manuscript, my understanding is that the primary purpose of RESCRIPt is database curation, specifically sequence databases and taxonomy databases. Since it wraps q2-feature-classifier in order to evaluate databases, it will also export the resulting classifier.

But, when I read forum articles such as Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt , it seems like RESCRIPt also supports evaluating a trained classifier, which I would like to do.
Specifically quotes like this
”The command below [… is …] to evaluate classification accuracy of our GreenGenes classifier.”

Since I am not using default k-mer settings to train my classifier, I assume that cannot use rescript evaluate-fit-classifier (I didn’t see a param option). So, my remaining option for evaluation is evaluate-cross-validate. My understanding is that, in order to kick the tires on my filtered UNITE sequence and taxonomy databases, this step creates a second, ‘in-house’ classifier. (“The database is split into K folds for testing; classifiers are trained K times on the remaining sequences; and the test set is classified using the corresponding classifier.”). So, although the purpose of this step is to evaluate the sequence and taxonomy databases, it is creating an ‘in-house’ classifier to do so.

The outputs of this step can then be used to evaluate classifier accuracy with evaluate-classifications, but my understanding is that this would be the in-house classifier, not the custom UNITE db that I trained previously.

My questions are:

Any errors in understanding above?
Is there a way to set ngram to [6,6] for the combined qiime rescript evaluate-fit-classifier ?
Is there a way to evaluate my custom UNITE classifier with RESCRIPt, so, accomplishing the effect of a param to specify my own classifier file in evaluate-cross-validate ? Or should my understanding be that mock ITS libraries are used to evaluate classifiers in downstream steps, by comparing known taxonomy to classification.
Is the primary purpose of rescript evaluate-cross-validate and rescript evaluate-classifications to evaluate the classifier, or to evaluate the DNA seq database and taxonomy databases? I can see from the GreenGenes tutorial link above, the .qzv output for evaluate-classifications is a plot of F-measure for a given dataset; F-Measure is the statistic used to evaluate a ML classifier. But I’m not sure if the interpretation of this visualization of a high F-measure means that the in-house classifier is good, or the databases are good, or both.

Thank you for any information or guidance!

Nicholas_Bokulich · November 23, 2025, 7:46pm

Hi @sibilant ,

Thanks for using RESCRIPt! Just to warn you, evaluate-cross-validate can be very slow. The idea is to use cross-validation as a fairly simple way to simulate classifier performance when it is fed unseen sequences (this is assuming that you do not have replicate sequences in the database). As opposed to evaluate-fit-classifier which trains a classifier and then classifies the same input sequences, so evaluates performance in the best-case scenario when there is an exact match between a query sequence and a training sequence.

Both of these are simply QIIME 2 pipelines that run a series of other actions under the hood. So you can replicate these pipelines. evaluate-cross-validate will be quite complicated to replicate, but evaluate-fit classifier is quite simple, to replicate this you can just take your custom trained classifier and then classify the same input sequences to get an estimate of performance.

Not unless if you want to modify the underlying code. This parameter is not exposed, unfortunately.

This is a very insightful question it really is a bit of both, or the same thing. Assuming that you have a suitable model (and in general we can say that the naive Bayes classifier performs quite well for both ITS and 16S, it has been tested under many conditions at this point and is fairly robust), the classifier performance is going give you an estimate of the taxonomic resolution of the database. At the very least, you can definitely test relative performance of classifier parameters (by using the same database but altering parameters), or of database variations (by using the same classifier parameters but altering the database, e.g., via filtering or clustering or whatever).

I hope that helps!

sibilant · November 25, 2025, 4:13am

Thank you for the helpful reply!

I got the outputs of evaluate-cross-validate, evalute-classifications and evaluate-taxonomy, and so far everything looks good to me.

evaluate-fit classifier is quite simple, to replicate this you can just take your custom trained classifier and then classify the same input sequences to get an estimate of performance.

I'd like to try this. My understanding is that the first step would be something like:

qiime feature-classifier classify-sklearn \
  --i-classifier my-custom-UNITEdb-classifier.qza \
  --i-reads my-custom-UNITEdb-seqs.qza \
  --o-classification custom-UNITEdb-taxonomy-classifications.qza

Then I would compare the custom-UNITEdb-taxonomy-classifications.qza output to the curated reference taxonomy used to generate the custom classifier, such as

qiime rescript evaluate-classifications \
--i-expected-taxonomies my-curated-ref-UNITEdb-taxonomy.qza \
--i-observed-taxonomies custom-UNITEdb-taxonomy-classifications.qza \
--p-labels customUNITEdb \
--o-evaluation custom-UNITE-classifier-evaluation.qzv

And then

qiime rescript evaluate-taxonomy \
  --i-taxonomies my-curated-ref-UNITEdb-taxonomy.qza custom-UNITEdb-taxonomy-classifications.qza \
  --p-labels ref-taxonomy predicted-taxonomy \
  --o-taxonomy-stats ref-vs-predicted-curatedUNITE-taxonomy.qzv

Is that a correct implementation of what you suggested above?

My other question is in interpreting these two sets of outputs (one from the 'in-house' classifiers on my curated UNITEdb described in OP, and one described here analyzing my custom classifier with the same curated UNITEdb).

Since I'm using my same curated UNITEdb with two different classifiers ('in-house' and custom),
then any difference in F-measure is due to the different classifier.

However, although it's the same UNITE db used for both classifiers, it's not an even playing field for how they were classified: the 'in-house' classifier used only some of the db per classification, thus all classifications were done on reads not found in the ref db fraction being analyzed ('novel' reads).

Whereas in the analysis with my custom classifier, the entire db was used for one classification analysis, and all reads that were classified were found in the db (the 'best-case scenario' for predicting classification, as you mentioned above).

Would it be reasonable to say that these two analyses are sort of boundary conditions of best and worst case scenario, and therefore the more representative F-measure of my curated UNITE db for classifying real environmental samples lies somewhere in the middle, on average?

Thank you again for making and sharing these tools, and helping us use them!

I have my eye on the tax-credit Python notebooks to level up my use of mock libraries in evaluating my ITS analysis pipeline; I'm excited.

Nicholas_Bokulich · November 25, 2025, 4:23am

Yes

Yes evaluate-fit-classifier — or replicating this simple pipeline with your custom database — is clearly best case, when the exact answer is known. evaluate-cross-validate is not exactly worst case, it can get much more worst case than that (e.g., how will a classifier perform when it encounters a totally novel clade? More complicated simulations can cover this case, but simple cross validation does not). I would say that cross-validation covers the more realistic case when an exact match is not found in the database (and you would need to use rescript dereplicate to make sure this is the case), but most likely some other similar match is.

Those notebooks are so old, I would recommend sticking with RESCRIPt where we brought these ideas much further. Though for sure tax-credit could still be dusted off if you want to try to run a more rigorous benchmark (e.g., the worst worst case that I hinted at above would be something covered in tax-credit but not RESCRIPt).

I hope that helps!

sibilant · November 25, 2025, 10:20pm

Thank you again for the reply!

I think I get what you mean about the tax-credit notebooks above, but want to check to be sure.

My understanding of your rec to use RESCRIPt instead is because RESCRIPt, especially the evaluate-cross-validate option, simulates classifications of many mock libraries created from our tax database, in a high-throughput manner, whereas the tax-credit notebooks are evaluating individual mock libraries inputted manually by the user. So, in terms of evaluating classifiers, your RESCRIPt rec makes sense to me.

I was thinking to use the tax-credit notebooks with mock libraries to evaluate my whole pipeline up to that point, including effects of DADA2 settings and Illumina read quality filtering and trimming decisions, not just the classifier.

I had been using Excel to do side-by-side comparisons of ref-vs-predicted taxonomies for small mock ITS libraries, which left much to be desired.

So, I was going to switch to the tax-credit notebooks for this.

But, after your comment, I did some more digging around the forum, and it seems like qiime quality-control evaluate-taxonomy and/or qiime quality-control evaluate-compositionis the likely tool people use for seeing how well their pipeline predicts the known mock library.

I'll plan on doing that (RESCRIPt to build/test curated UNITEdbs and classifier; qiime quality-control to compare my results to mock library ref taxonomy). I welcome any additional advice or recommendations.

I also wanted to share my two cents with the wider Q2 forum about the value of the tax-credit notebooks for inexperienced users like me to appreciate the more granular details of classifier testing that are 'under the hood' of RESCRIPt, both to get a feel for how these classifiers and parameters and statistics are interacting, and as a great option for specific breakout analysis that may be beneficial, like you mentioned above.

Thanks again so much for the replies! How fortunate to hear from the very same Dr. Bokulich of the RESCRIPt and q2-feature-classifer publications.

Happy Thanksgiving!

Nicholas_Bokulich · November 27, 2025, 5:41am

Hi @sibilant ,
Yes exactly, the tax-credit notebooks are more customizable for running a thorough benchmark for taxonomy classification. The evaluation actions in RESCRIPt and q2-quality-control allow higher throughput but not quite as much control (particularly for the classifier training options).

Very glad to hear that you are finding both of these useful for your experiments!