I have recently completed a project using qiime2. I chose to train a
naive bayes classifier and used
feature-classifier classify-sklearn to assign taxonomy. I could use some help understanding how the machine learning confidence works. I am used to assigning reads based on a percent confidence. I did not set the confidence and just used the default
--p-confidence 0.7. I want to be prepared with a proper response when someone asks me what threshold I used to classify my taxonomy.
Should I be setting an additional threshold for taxonomy to keep at a certain level?
Any insight would be extremely helpful!
So the official documentation for the
sklean scripts in Qiime 2 point to this page:
I’m also curious to know the ‘microbiologist friendly’ explanation of this method. There is some discussion here, but I’m still looking for a consice defense of this method.
Yes, I do believe this method is more robust but I expect to get questions about what I used as a % cutoff when classifying, and I’d like to have a better reason other than it was the default haha!
Always wise to be prepared .
The default confidence setting was optimized for 16S/ITS data as described in this preprint. That preprint also describes how to adjust that score for different analysis scenarios, e.g., high-precision (very low false-positive) and high-recall (low false-negative) settings.
If you have 16S or ITS data, the default should be reasonable for you, and you can use the rationale provided in that preprint, that the default setting provides a balance between recall/precision for these data (both short reads and full-length 16S).
If you have 18S data, I suspect there would be very similar behavior to 16S since it is all SSU rRNA, but we did not explicitly test this (we do have an 18S mock community in mockrobiota if you want to test this yourself).
If you have a different marker gene, then you’re backed into a corner and have no choice but to stick to the default as your rationale — after all, there is probably no information out there anywhere for best practices for classifying your esoteric marker gene, so may as well use what has worked elsewhere (the confidence settings are really characteristic of the classification methods, not the marker gene used, so should be fairly generalizable for other marker genes. you can use that sentence verbatim and your reviewer will probably not have evidence to the contrary).
I hope that helps!
Thanks you for the helpful explanation. This makes a lot more sense to me now.
As a follow up question, using this new method should I be combining OTUs that have the same species level taxonomy? Do I need to set a certain confidence value to determine a species level as we had done with previous classifiers?
No. Your OTUs are still distinct species variants and may be important for distinguishing samples. Just because they have the same ID does not mean they should be collapsed… though this is automatically done when summarizing this information, e.g., in a barplot.
No. The same confidence parameter is used at each taxonomic level. The classifier provides the deepest level of taxonomic assignment that exceeds that confidence threshold. So if the classifier is > 70% confident in the species-level classification, that sequence will be classified at species level. If there are not good matches at species or genus level, it will only report a family-level classification. The default parameter setting is reasonably cautious, so as to avoid false-positive errors (particularly at species level), but will often yield species-level classifications for 16S and ITS data.
I hope that helps.
2 off-topic replies have been split into a new topic: Using tax-credit for 18S rRNA classifier evaluation
Please keep replies on-topic in the future.
This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.