Detailed explanation of confidence calculation on tax assignments

Hi

I want to clarify details about method of confidence calculation

I read this one (very good), but still have questions

I missed this one (citation from link above)

The confidence of a classification is calculated by bootstrapping (subsampling) the bag of 8-mers 100 times, and seeing how many times the subsample comes up with the same classification as the full read

Is that true that if I had read with length 300, then I will gen 100 random 8-mers from it, will classify whole read with Naive Bayess, then 100 times classify random 8-mers with Naive-Bayes. In whole length I will get classification "species A". In 8-mers I will get classification species A, species B, species C. So confidence will be the relation (species A) / (species A + species B + species C). Am I right? If so - then algorithm looks very strange.

Also I want to clarify how confidence united to more general levels. Return to example above

let

species A be 20%
species B be 50%
species C be 30%

species A and species B have common genus (L6) and species C will have only common Kingdom (L1) with species A and species B

then confidence

L7 - 20%
L6 - 70%
L5 - 70%
L4 - 70%
L3 - 70%
L2 - 70%
L1 - 100%

Am I right?

Hi @biojack ,

For a detailed explanation of how Naive Bayes classifiers work, you can start here:

The confidence scores reported by q2-feature-classifier are the class probabilities reported by whatever algorithm you use (in the default case, Naive Bayes, but other classifiers can also be used). We dropped the RDP-style bootstrapping approach fairly early on (that is a 2016 forum post that you are referencing, and already in that post you can see that Ben mentioned dropping this feature).

If you are interested in the bootstrapping approach, I recommend reading the various articles about the RDP classifier, which introduced this approach for taxonomic classification. I believe I shared the original 2007 article with you in a different topic, so you could use that article as a starting point.

Good luck!

2 Likes

yes, that's why my question is about confidence not about Naive Bayess :wink: Naive Bayess I've mentioned as default example

So if I correctly understand you - approach of bootspraping was changed - so today is not percentage of 8-mers classified as full-read, there are some another bootspraping approach in qiime2? Could you describe it or give a reference on modern approach (off course if it is trade secret - hands off )

Also could you clarify if my example with aggregated percentages by levels is correct? If not - would be great to see correct percentages