How to set up a high recall classifier- change k mer length


I have a 16S paired end dataset on which I have just run the assign taxonomy step which resulted in many consensus taxonomies that were not to the genus level which was my goal. The goal of my study is to explore bacteria populations in environmental samples so I am more concerned about deeper taxonomy than false positives. After reading up on taxonomy assignment including the Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin paper I saw that for situations like this you can run a high recall assign taxonomy step to get deeper taxonomy with the understanding there is a greater chance for false positives by changing the confidence from 0.7 to 0.5 and use a k mer length between 12-32.

I know that you change the confidence using the --p-confidence flag with the classify-sklearn plugin but I was wondering where you adjust the k mer length and/or how to check which k mer length you have used?

Any help would be much appreciated!

Hi @jjankowiak!

kmer length is set during the fit-classifier-naive-bayes step — so you will need to train your own feature classifier. The parameter you are looking for is --p-feat-ext--ngram-range.

Check the provenance on your file (with or qiime tools view) — kmer length will be listed as the parameters to the fit-classifier-naive-bayes method used to train the classifier.

If you are using a pre-trained classifier, I believe the kmer length is 7 or 8 (the default is currently 7, but old classifiers used a kmer length of 8).

I hope that helps!

1 Like

HI Nicholas_Bokulich,

Thank you for the quick response, this is exactly what I was looking for! Just to make sure I use this flag correctly if I wanted to change the k mer length to 12-32 as suggested in the paper I would just input --p-feat-ext–ngram-rande 12, 32 into my fit-classifier code ( is that the correct formatting of the numbers)?

And just to double check to run the high recall all I have to do is retrain my classifier with this adjusted k mer length and then run the assign taxonomy step with the classify-sklearn plugin with the adjusted confidence of 0.5?

I believe the format would be: [12,32]


Let us know if you have any more questions.

Great, thank you for all the help!

1 Like

HI Nicholas_Bokulich,

I had a follow up question to the creation of a high recall classifier. As discussed above I am refitting my classifier with the --p-feat-ext--ngram-range flag to change the k mer length to make a high recall classifier. While fitting my old classifier took only several hours this new classifier has been running for over a day now. I unfortunately forgot to use -verbose so I was wondering if this extra run time is to be expected and if so if there is guesstimate for how many times longer this would run than with a k mer length of 7?

Here is the script I ran:
qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads /home/qiime2/Desktop/SILVA_132_16S_V4_505806/ref-seqs.qza --i-reference-taxonomy /home/qiime2/Desktop/SILVA_132_16S_V4_505806/ref-taxonomy.qza --p-feat-ext–ngram-range [12,32] --o-classifier /home/qiime2/Desktop/SILVA_132_16S_V4_505806/high_recall_classifier.qza

Sorry! We did not benchmark the effect of this parameter on time, so I do not know off-hand.

However, note that using an ngram range (e.g., [12,32] is going to produce kmers of all lengths in that range. This is a very very rough underestimate — because the number of unique 32mers will probably be much greater than, e.g., 12mers or 7mers (since this is the total number of kmers of that size), and it is impossible to predict how many there would be at a specific size (since it all depends on the sequence heterogeneity in your reference sequences, so I believe it is not simply 4^k) — but it will be at least 21X as long as using [7,7] (since that only produces 7mers, not 12mers and 13mers and 14mers etc…).

You could see if one of the other high-recall classifiers uses a tighter range and that will take less time to generate…

Sorry it’s taking so long!

Thank you for the great explanation that clarifies a lot. I was basing the k mer range on this portion of the paper:

For 16S rRNA gene sequences, naive Bayes bespoke classifiers with k-mer lengths between 12 and 32 and confidence = 0.5 yield maximal recall scores, but RDP (confidence = 0.5) and naive Bayes (uniform class weights, confidence = 0.5, k-mer length = 11, 12, or 18) also perform well (Table [2]).

So I thought I should use the full range from 12 to 32. However now looking at the recall section in table 2 for the naive bayes classifier I now see the [11,11] seems to have the highest F, P and R scores across the board (compared to the 12 and 18 k mer length) with a confidence of 0.5. I think I will try with just the [11,11] range for now since I have a quickly approaching deadline.

Thank you again for the fast and detailed explanations, I now understand what these parameters mean and how to adjust them for different goals!


I had one further follow up question regarding the high recall taxonomies. After comparing my original taxonomy file (k mer length 7, confidence 0.7) to the taxonomy file created using my high recall classifier (k mer length 1, confidence 0.5) I noticed that some ASVs were assigned deeper taxonomies in the high recall taxonomy file but others had deeper taxonomies in the original taxonomy file ( typically a difference of one level). If my goal is to explore bacteria populations from systems which may have many unknowns in it is it alright to use the use the deepest taxonomy identified for each ASV (essentially have a mixed taxonomy file) and just note that the taxonomies used are from different methods? If so is there any way to merge the two taxonomy files keeping the deeper taxonomy?

I would discourage merging the deepest taxonomies, if only because it gets very messy from a record-keeping and reporting standpoint.

If you are studying environment with many unknowns, wouldn’t it be better to know that they are unknowns?

The “high-recall” classifiers were chosen as those that maximized recall, but still had high precision. This generally is linked to deeper taxonomic classifications, but will not necessarily behave the same for all ASVs, so it makes sense that you see deeper classification for some but shallower for others.

I am not encouraging this, since precision will suffer and you may get misleading results, but the best way to deepen taxonomy classifications is to further reduce the confidence parameter. And if you really just want to “closest match”, you can use classify-consensus-blast with maxaccepts set to 1.

I hope that helps!

Thank you for the help. I wanted to use the high recall classifier instead of setting max accepts to 1 to keep some precision, I was just hoping for deeper classifications for some ASV that only when to the phylum for example. I guess I will compare the taxonomies across files and use the file that overall has deeper classifications ( which I would assume is the high recall).

All the help is much appreciated

oh phylum is not too good — usually we should see deeper classification unless if the sequences are really short! Do you want to share some examples? A barplots QZV? We can make sure there isn’t another issue. You may also want to NCBI blast those individual sequences to see what the top genbank hit is… these could be non-target DNA, hence the poor classification.

I would also expect high recall to have deeper classifications, but not necessarily by much. It makes sense to check out both and see which fits your experimental goals.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.