I have just come into contact with Qiime2. Recently, when I used the 'qiime feature-classifier classified-sklearn' to classify the public data of a study, the annotation results I got were inconsistent with the results in that study, as shown in:
no lactococcus, more streptococcus.(the study noted that lactococcus had a high abundance in most samples, while streptococcus had only 7.51%±11.61%.).
The processing tool used in that study was Qiime1.The reason may be that Q2 classified-sklearn misclassified lactococcus as streptococcus,so I tried to modify the confidence level of classified-sklearn and use the qiime feature-classifier classified-consensus-vsearch, but the effect was not good.
Therefore, I want to know the working principle of 'qiime feature-classifier classified-sklearn' or the specific data processing steps (which related to the confidence threshold). Can someone help me?
Or maybe it's a good thing: Qiime 2 should offer better taxonomic resultsion than Qiime 1, so maybe this is a useful finding / correction to the original work
For 16S rRNA gene sequences, naive Bayes bespoke classifiers with k-mer lengths between 12 and 32 and confidence = 0.5 yield maximal recall scores, but RDP (confidence = 0.5) and naive Bayes (uniform class weights, confidence = 0.5, k-mer length = 11, 12, or 18) also perform well .
After reading the script _skl.py posted on github to train the classifier, I didn't find the code to set k-mer length, and I wonder if you can make a solution.
Your reply has given me great help to learn qiime2. I wish you a good day .
Absolutely. This occurs during classifier training, with the --p-feat-ext--ngram-range option. The same kmer length(s) will be used during classification (i.e., the same pre-processing is applied to the query sequences).