Hi, I’m looking for some guidance on how to interpret the ‘Confidence’ column for taxonomic assignments obtained using classify-sklearn (if it matters, I have been using the classifier pre-trained on trimmed 99% OTUs from Greengenes).
I see that there was another post on this topic in 2017, but since the answer then indicated that the notion of confidence was under devlopment, and I also noticed the default value for --p-confidence parameter in the current version of the classify-sklearn plugin differs from what is described in the answer to that question, I was hoping to get an updated response.
Can we interpret the Confidence value as % sequence identity to the reference sequence for the taxon an ASV is assigned to (i.e. 1 is 100% sequence identity to the reference sequence, 0.95 is 95% sequence identity), or is there something else going on?
There have been no changes since then. We were testing out different ways to calculate confidence but wound up sticking with the original.
we wound up adjusting the default, based on benchmarking results, but not how confidence was calculated.
No, this is totally unrelated to % identity. Confidence values here are the raw probability estimates output by the naive Bayes classifier, i.e., the predicted probability that the predicted taxon is correct and not another taxon. Naive Bayes classifiers are good at classifying but poor at estimating probabilities, so the “confidence” scores should not be taken too seriously.... just a rough estimate of how confident the classifier feels about its own prediction!