Understanding naive bayes+sklearn classifier, top-hit identity distribution

Hi all - currently analyzing eukaryote short amplicon metabarcoding datasets with classify-consensus-vsearch and -blast, as well as sklearn with naive bayes training sets in qiime2-2022.8. A few questions:

  1. Looking for help understanding the mechanics of the naive bayes-sklearn classification process and any additional documentation that would help explain the default naive bayes training parameters.
  2. Would also love to make visualizations of my reference dbs as interpreted in distance/similarity by qiime's naive bayes classifier - not sure if there is a qiime function that will accomplish this. Something like a PCoA plot but with the clusters being generated with ref sequences (fasta) rather than samples, and then I would like to label the points with taxonomy. Or some other way that would help me better understand how well the nb classifier can distinguish between different taxa groups with my ref dbs!
  3. Is there a qiime feature for generating top-hit identity distributions for ASV classifications with the consensus classifiers? Like this example from usearch!
    Thanks for any suggestions!

Good evening,

For naive-bayes classification:

Original paper: Wang et. al. 2007 with over 17k citations !! https://journals.asm.org/doi/full/10.1128/AEM.00062-07
The Qiime2 version: Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin | Microbiome | Full Text
The default settings are based on findings from that paper. Post you questions about specific ones!

That is such a good idea, but I don't remember seeing it done before!

You can do PCoA of features instead of samples, and use a distance matrix of edit-distances instead of UniFrac distances or whatever. The plot shows the 'density' of database coverage and how it varies based on taxonomy.

This would totally work and I would love to see it!

Not yet. Have you tried writing his suggested script?

1 Like

Thanks for your thoughtful reply, Colin!

I've read these papers, but I guess I'm still confused as to what 'features' the nb sklearn classifier uses to calculate likelihood probabilities (k-mer length, for example? maybe this is a parameter I missed). I think I understand that prior probabilities come from counts of ref seqs by taxonomic rank, but please correct me if I'm wrong there too. I'm trying to figure out if the --p-confidence threshold for sklearn nb corresponds to a posterior probability, and if that's the case, then I would like to know how it's calculated.

I think if I knew which features to use with a distance matrix and how to extract them, then I could do what you suggested! It would be great if you could clarify a bit more and if possible direct me to q2 documentation with applicable code examples?

I don't have usearch and my mac OS is unfortunately not compatible with the free 32-bit version, so I haven't tried it. Was hoping there might be a way with qiime to generate the same information - oh well!

1 Like

Hi @marymcelroy ,

kmer frequencies per taxonomic group

close! that would be the default, but the frequency of taxonomic ranks in the reference databases would be highly skewed and artificial (i.e., toward model species, pathogens, easily cultivated organisms, etc). So by default q2-feature-classifier (and RDP classifier and every other naive bayes classifier out there used for sequence classification) assumes a uniform prior, i.e., the prior probability term is replaced with a constant so effectively removed from the equation. HOWEVER, in QIIME 2 it is possible to adjust the prior based on expected species distributions, see here:


You can see the scikit-learn documentation for more details on what is happening under the hood. Start here:

But for a more user-friendly explanation of Naive Bayes in the context of taxonomic classification check out this section of applied bioinformatics from @gregcaporaso (see the section on Naive Bayes classifiers — but note that this is for educational purposes and does not exactly match what scikit-learn is doing):

serendipitously, it looks like the IAB chapter that I linked above would also show you how to do this (outside of QIIME 2). A function inside of QIIME 2 to make such pairwise comparisons between sequences would be neat :wink: (not sure who I am winking at, maybe @gregcaporaso :joy: )

It looks like that is just visualizing the distribution of %id scores. Another option could be to run qiime feature-classifier vsearch-global --top-hits-only (adjusting some other parameters as appropriate) to generate a report of top hits to all queries. This could be exported as a TSV and plotted in R, python, etc. But there is not an equivalent function already existing in QIIME 2. Would make a great contribution to either q2-feature-classifier or q2-quality-control :wink:


OK, I did the thing:



Wow, thanks Colin and @Nicholas_Bokulich! I really appreciate the responses, especially the IAB and code examples from you both. I'll take a look at all this.

One last question for now. I think retraining my classifiers with class weights would improve species level assignments as recommended in the Kaehler paper you shared. I know the ref seqs in my dbs are high-quality and identified to species, but taxonomic representation relative to what I would expect to find in my environmental samples is low (20-40% of expected species represented in habitat-specific ref dbs with definite bias for some taxa groups, like way more verts than inverts or algae). So there is high potential for novel query seqs that have not been used in the training process as well as rare species. I'm interested primarily in marine macro-eukaryotes, and I've used multiple short markers 114bp-313bp.

Any suggestions for how to improve classification with nb classifier in this context? It might be difficult to determine appropriate bespoke weights, but it sounds like using avg class weights from a global ref dataset (rather than regional/habitat) could help or at least be better than uniform weights? Not sure if I can use q2-clawback for this (I already have global+regional ref dbs and taxonomy files set up), so any implementation tips would also be appreciated. Thanks again!

I am less familiar with k-mer nb classifiers and Basian weighted priors than top hit LCA classifiers, so I'll stick to what I know.

If you have not found it already, check out RESCRIPt and its detailed tutorial. Instead of adjusting Basian priors like q2-clawback, RESCRIPt works to better target/curate the region and taxonomy labels in your database. And it includes benchmarks of how well it works:

This method provides information about the number, entropy, and other characteristics of taxonomic labels at each rank in an input taxonomy. It is much less time-consuming than the classification methods above, and provides valuable information about the amount of “information” in a taxonomy artifact.

With a better database, all methods will work better. I would start there!


Hi @marymcelroy ,

Yeah as you are using non-bacterial-16S marker genes (for which much data exists) I am sort of doubting whether there is enough representative data to build appropriate taxonomic weights. You could try making average weights with 18S data in general (for which enough data exists), which could be really nice to see. But I worry about this part:

If you are expecting such a high number of unrepresented/novel species, you might not want to use a bespoke classifier, as the whole idea is to upweight expected species based on prior info. When you don't know what to expect, or when you expect the unexpected, this might not be a good thing! It could be worth a try but it would be a challenge to set up.

Taxonomic weights are an alternative to the regional database approach, as they allow you to instead upweight those expected in a given region while still leaving open the possiblity that "other things are out there". The regional database approach is prone to misclassification/overclassification issues, as you restrict the search space while assuming that you have total representation, which clearly only works when your database really does have good coverage. So I worry about whether this would work when you expect such a high number of uknown species.

Thanks @colinbrislawn ! Just to add, it is not an either/or necessarily. There is added value in using these together (but only when appropriate prior data exists for q2-clawback, of course). RESCRIPt is for sure the more general purpose tool, useful for working with any type of marker gene data... but clawback is more specific.


Okay, this is great info, Nicholas! Thank you so much for these insights. Truly, I am learning a lot from this exchange. Yes, I was concerned with the low level of species representation in my regional ref dbs, which I created by filtering the global ref dbs based on a regional species list. Maybe it would be better to use the global dbs and add class weights to regional species there?

If ref db representation is better at higher taxonomic ranks, could I be more confident in those classifications using either global/regional dbs, esp. if I'm setting high --p-confidence values (like 0.90 or 0.95)? How does the nb classifier decide if a genus classification is better than a species annotation, for ex? Does it calculate post probabilities for all ranks in the ref db and not just for species? I read this post, and it seems maybe to answer my question. That is, if the species level annotation doesn't meet the --p-confidence threshold, then it will sum the probabilities for all the species from the genus in the training set and so on through higher ranks until the threshold is exceeded. Do I have that right?

Does the risk of over/misclassification of novel queries come from situations where my training sets give high post probabilities for certain species that would potentially exceed my confidence threshold and wrongly assign novel seqs to species rather than a higher rank? I assume this is more of a risk with my regional ref dbs.

1 Like

Thanks, Colin. I opted to use Gert-Jan Jeunen's CRABS reference database curation program, but it looks like a similar process to what is implemented with RESCRIPt! The ref dbs I'm using with the q2 classifiers are outputs from this curation workflow.

1 Like

That is in theory the better option. That is what I recommend for bacterial 16S, since lots of prior data exists and the databases have reasonably good coverage for most habitats. In your case, since you are using different marker genes and expect a lot of unknown, I am just not sure, since you might not have enough prior data :man_shrugging:

For global, yes. For regional, not necessarily. Unless if a regional database contains other near neighbors (e.g., various local species belonging to the same genus), then anything in that clade will classify to species level as there are no other near hits. This is a good thing, of course, if the local biodiversity is very well characterized and contamination/PCR/sequencing errors are non-existent. But otherwise it is a bad thing, as unknown species/contaminants/errors will wind up misclassified to the nearest group in the database. So if you expect a large number of unknowns it might be worth using the global database to avoid overclassification — depends on whether you prioritize deeper classification of the "knowns" or preventing overclassification of the unknowns.


Yes! Very well put.