How is the "Confidence" calculated with taxa assignments?

BenKaehler · December 14, 2016, 2:57am

Thanks @John_Chase, the Confidence column isn't documented yet, sorry. It is not yet a stable feature.

For the current release, I have tried to mimic the way that RDP classifier calculates confidence values. Also, confidence is only calculated and used if the confidence parameter is set to a non-negative value when calling the classify method.

The basic classification method is to decompose the read into a bag of overlapping 8-mers, then feed that as input to the machine learning (Naive Bayes by default) classifier. The confidence of a classification is calculated by bootstrapping (subsampling) the bag of 8-mers 100 times, and seeing how many times the subsample comes up with the same classification as the full read. If the confidence parameter is between zero and one, the classifier will start at the top taxonomic level and work its way down the levels until the calculated confidence falls below the value of the input parameter. At that point it will truncate the classification to the last good level and report the calculated confidence in the Confidence column.

This feature is unstable because in a future release we will allow machine learning classifiers other than Naive Bayes to be used, and the bootstrap method does not generalise well to other classification methods. We have an alternative strategy that will be released at that time.

I hope that helps. Please let me know if you have any further questions.