How is the "Confidence" calculated with taxa assignments?

In the FeatureData[Taxonomy] object there is a column labeled “Confidence”. I may have simply missed this in the documentation, but I was wondering if someone could help me to understand how this is being calculated?

Thanks!

@John_Chase, I contacted the developer, Ben K, about this. He doesn’t yet have a forum account, but I expect he’ll get back to you pretty quickly.

Thanks @John_Chase, the Confidence column isn’t documented yet, sorry. It is not yet a stable feature.

For the current release, I have tried to mimic the way that RDP classifier calculates confidence values. Also, confidence is only calculated and used if the confidence parameter is set to a non-negative value when calling the classify method.

The basic classification method is to decompose the read into a bag of overlapping 8-mers, then feed that as input to the machine learning (Naive Bayes by default) classifier. The confidence of a classification is calculated by bootstrapping (subsampling) the bag of 8-mers 100 times, and seeing how many times the subsample comes up with the same classification as the full read. If the confidence parameter is between zero and one, the classifier will start at the top taxonomic level and work its way down the levels until the calculated confidence falls below the value of the input parameter. At that point it will truncate the classification to the last good level and report the calculated confidence in the Confidence column.

This feature is unstable because in a future release we will allow machine learning classifiers other than Naive Bayes to be used, and the bootstrap method does not generalise well to other classification methods. We have an alternative strategy that will be released at that time.

I hope that helps. Please let me know if you have any further questions.

6 Likes

@BenKaehler Thanks for the response! This answers my question.

2 off-topic replies have been split into a new topic: Does feature-classifier use the same kmer length for training and classification?

Please keep replies on-topic in the future.