I'm baffled by low classification confidence levels assigned using qiime feature-classifier classify-sklearn for ASVs with 100% ID and cover.
I'm doing targeted amplicon sequencing for piroplasmids so I've created a classifier using Rescript from selected sequences for protists from the EukRibo and NCBI RefSeq databases. I've imported them from NCBI using their accession IDs. I've created a region-specific classifier using my primer sequences, dereplicated in uniq mode, then evaluated and fit the classifier using the same reference sequences (n = 308) with qiime rescript evaluate-fit-classifier. This shows 98.5% assignment to genus and 91.7% assignment to species.
However, when I use this classifier with sequence data I've generated using qiime feature-classifier classify-sklearn (standard parameters), some sequences are showing surprisingly low confidence values. For example, an ASV from the positive control has a confidence value of 70%, yet if I enter this ASV into an NCBI BLAST search, it shows 100% ID and cover to the RefSeq sequence, which I've doubled checked is included in my classifier. In general the confidence of assignment seems lower than I'd expect across the board but some ASVs are showing >99.9% confidence at species level. I cannot work out what is going on, so would appreciate some guidance.
I've uploaded the classifier, classifier evaluation, input reads and output classifications in case helpful.
It is important to note that ML classifiers and local alignment operate in very different ways, and confidence and % identity are calculated in very different ways and should not be conflated. 70% confidence is actually relatively high (considering that this metric sums to 100% including all other possible hits).
It sounds like your classifier is working quite well overall, and both the evaluation and performance on real data is quite high. So based on your description it sounds like there is not a "problem" per se, but just some clades that classify worse than others.
If confidence is relatively low, this means that there are other sequences in your reference database that have relatively similar kmer compositions, increasing the probability that your sequence could belong to one of these taxa and hence inevitably reducing the confidence of the primary hit (as confidence is a relative metric, not absolute). This could be genuine, e.g., because there is another species with a fairly similar sequence (or kmer profile to be more precise), but it could also be due to noise in the database (e.g., misannotation, which is fairly common and occurs even in the NCBI RefSeqs; or polyphyletic clades, also very common and perhaps misannotation by another name ). Some things to look at:
BLAST shows 100% ID and coverage vs. the expected species. But are there other hits with a high % ID that may belong to a different species?
You could look at the % ID of your reference sequences. One "easy" way to do this would be to use the dereplicate action in RESCRIPt to cluster at, say, 97% similarity and in lca mode. This will show if/how some clades have reduced resolution due to sequence similarity in your domain of interest (do this after extracting your amplicon target to simulate accuracy for this target). You might even see some OTUs with rather shallow classifications — this would indicate some "pollution" in that cluster, caused most likely by a misannotated sequence.
Just some food for thought re: confidence and naive Bayes classifiers. I recall reading in one of the old RDP classifier papers (another naive Bayes classifier) that they recommended a 50% confidence threshold for short 16S rRNA gene reads with RDP classifier. You can have 100% ID but 50% confidence... it all depends on what other hits occur (and this is one reason why % ID can likewise be misleading or even meaningless).
You could use the evaluate-taxonomy action in RESCRIPt to look at the depth of classification for all of your ASVs... this might be a useful way to summarize how performant your classifier was with real data.