I want to clarify details about method of confidence calculation
I read this one (very good), but still have questions
I missed this one (citation from link above)
The confidence of a classification is calculated by bootstrapping (subsampling) the bag of 8-mers 100 times, and seeing how many times the subsample comes up with the same classification as the full read
Is that true that if I had read with length 300, then I will gen 100 random 8-mers from it, will classify whole read with Naive Bayess, then 100 times classify random 8-mers with Naive-Bayes. In whole length I will get classification "species A". In 8-mers I will get classification species A, species B, species C. So confidence will be the relation (species A) / (species A + species B + species C). Am I right? If so - then algorithm looks very strange.
Also I want to clarify how confidence united to more general levels. Return to example above
let
species A be 20%
species B be 50%
species C be 30%
species A and species B have common genus (L6) and species C will have only common Kingdom (L1) with species A and species B
For a detailed explanation of how Naive Bayes classifiers work, you can start here:
The confidence scores reported by q2-feature-classifier are the class probabilities reported by whatever algorithm you use (in the default case, Naive Bayes, but other classifiers can also be used). We dropped the RDP-style bootstrapping approach fairly early on (that is a 2016 forum post that you are referencing, and already in that post you can see that Ben mentioned dropping this feature).
If you are interested in the bootstrapping approach, I recommend reading the various articles about the RDP classifier, which introduced this approach for taxonomic classification. I believe I shared the original 2007 article with you in a different topic, so you could use that article as a starting point.
yes, that's why my question is about confidence not about Naive Bayess Naive Bayess I've mentioned as default example
So if I correctly understand you - approach of bootspraping was changed - so today is not percentage of 8-mers classified as full-read, there are some another bootspraping approach in qiime2? Could you describe it or give a reference on modern approach (off course if it is trade secret - hands off )
Also could you clarify if my example with aggregated percentages by levels is correct? If not - would be great to see correct percentages