Greetings Q2 and RESCRIPt community, ![]()
I am using qiime2-amplicon-2025.7, and am new to qiime2.
I am analyzing ITS1 amplicon metagenomic data from MiSeq Illumina runs. I am benchmarking my pipeline with mock ITS datasets, using tutorials for DADA2 with the UNITE database.
I followed Q2 forum recommendations to filter the UNITE database to remove unclassifieds, which means leaving the nice pre-trained UNITE classifiers for now. I used RESCRIPt to filter out unclassified sequences and other recommended curation steps.
I wanted to use the settings for naive-Bayes classifiers with UNITE recommended in Bokulich et al 2018 ‘Microbiome’, which specified k-mers of [6,6]. So, I trained my classifier as follows.
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-taxonomy taxonomy_UNITEdb_noSH_derep_20251118.qza
--i-reference-reads sequences_filtered_derep_culled_UNITEdb_20251118.qza
--o-classifier classifier_UNITEdb_20251122.qza
--p-feat-ext--ngram-range '[6, 6]'
This appeared to work great. So, now I have custom UNITE seq and taxonomy db, and a corresponding custom classifier trained on those dbs. ![]()
I would have preferred to use the combined training and evaluation of classifier tool provided: qiime rescript evaluate-fit-classifier.
But, I did not see a way to specify the ngram range in this RESCRIPt wrapper of qiime2’s feature-classifier, and it looks like the default is [7,7].
Next comes evaluation.
From the RESCRIPt manuscript, my understanding is that the primary purpose of RESCRIPt is database curation, specifically sequence databases and taxonomy databases. Since it wraps q2-feature-classifier in order to evaluate databases, it will also export the resulting classifier.
But, when I read forum articles such as Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt , it seems like RESCRIPt also supports evaluating a trained classifier, which I would like to do.
Specifically quotes like this
”The command below [… is …] to evaluate classification accuracy of our GreenGenes classifier.”
Since I am not using default k-mer settings to train my classifier, I assume that cannot use rescript evaluate-fit-classifier (I didn’t see a param option). So, my remaining option for evaluation is evaluate-cross-validate. My understanding is that, in order to kick the tires on my filtered UNITE sequence and taxonomy databases, this step creates a second, ‘in-house’ classifier. (“The database is split into K folds for testing; classifiers are trained K times on the remaining sequences; and the test set is classified using the corresponding classifier.”). So, although the purpose of this step is to evaluate the sequence and taxonomy databases, it is creating an ‘in-house’ classifier to do so.
The outputs of this step can then be used to evaluate classifier accuracy with evaluate-classifications, but my understanding is that this would be the in-house classifier, not the custom UNITE db that I trained previously.
My questions are:
- Any errors in understanding above?
- Is there a way to set ngram to [6,6] for the combined
qiime rescript evaluate-fit-classifier? - Is there a way to evaluate my custom UNITE classifier with RESCRIPt, so, accomplishing the effect of a param to specify my own classifier file in
evaluate-cross-validate? Or should my understanding be that mock ITS libraries are used to evaluate classifiers in downstream steps, by comparing known taxonomy to classification. - Is the primary purpose of
rescript evaluate-cross-validateandrescript evaluate-classificationsto evaluate the classifier, or to evaluate the DNA seq database and taxonomy databases? I can see from the GreenGenes tutorial link above, the .qzv output forevaluate-classificationsis a plot of F-measure for a given dataset; F-Measure is the statistic used to evaluate a ML classifier. But I’m not sure if the interpretation of this visualization of a high F-measure means that the in-house classifier is good, or the databases are good, or both.
Thank you for any information or guidance! ![]()