P-confidence in "qiime feature-classifier classify-sklearn"

Hi everyone,
I am trying to understand better the meaning of the p—conficence in the “qiime feature-classifier classify-sklearn”.
I red here (How is the "Confidence" calculated with taxa assignments?) that the confidence is related to how many times the subsample (of sequences) comes up with the same classification.
Is there any association between the confidence parameter and the percentage of identity (between our sequences and database sequences)? May you explain more in details the meaning of the confidence in the “qiime feature-classifier classify-sklearn”. I saw that by default, the p—condincence value is 0.7.

Thank you in advance for your help.
Best regards.

Charlotte

Hi Charlotte,
The confidence value can’t be directly translated into a percent identity, though in general a high confidence at a lower taxonomy level (e.g., species) is probably indicative of a high percent identity database match. However, it is also possible to have a high percent identity match but a low confidence, if the sequence that you matched is identical across several different taxonomic groups. For example, if your query sequence (the sequence you’re trying to classify) is ACCGGTT, and in the reference database you see the following sequences and associated taxonomies:

ACCGGTA : Genus 1
ACCGGTC : Genus 2
ACCGGTG : Genus 3

you will have a low confidence genus level assignment, even though your query sequence is almost a perfect match, because it’s almost a perfect match to three sequences that are in different genera. (In practice your sequences would of course be much longer than the ones I’m using in my example.)

You can find some more discussion of how confidence is computed in the original paper on the RDP classifier.

Hope this helps!

3 Likes

Thank you @gregcaporaso a lot for your answer!
Best regards.

Charlotte

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.