I'm trying to understand the output (taxonomy.qza) of the qiime feature-classifier classify-sklearn command. Attached to this post is the visualization version of the output (taxonomy.qzv).
Basically, I'm having a hard time understanding why some of the ASVs are classified as "Unassigned" when they have a confidence value that's greater than the default threshold (c=0.7). For example, in the taxonomy.qzv file, a feature (or ASV) called bd3cc9947a6f5180917fd5200b25e79b is assigned as "Unassigned" even though it has a confidence value of 0.773254852.
However, according to this beautifully well explained reply by @BenKaehler, shouldn't it have been assigned an actual taxon since the bootstrapping process ("the bag of 8-mers 100 times") produced the same classification as the full read did roughly 77/100 times, which is greater than the threshold (70/100 times)? This reply by @Nicholas_Bokulich also seems to support my idea:
"So if the classifier is > 70% confident in the species-level classification, that sequence will be classified at species level."
BTW, I can 100% understand why the ASVs with a confidence value less than 0.7 are classified as "Unassigned".
I feel like there got to be a simple explanation for this and I'm just missing a piece. If you could enlighten me as for why the above makes sense (or does not), I'd greatly appreciate it. Thank you.
P.S. Please note that I'm using the qiime2-2020.8 version. For taxonomy assignment, I used a pre-trained Naive Bayes classifier (silva-138-99-nb-classifier.qza). I'm aware that training a classifier based on my own dataset can result in better classification, and that's in my to-do list. But, I don't think my issue is related to this.
To answer your question, if it is not possible to classify at the highest level (usually kingdom) at the required confidence level, the classifier outputs "Unassigned" for the classification and for the confidence it returns one minus the confidence that it had in the original kingdom-level classification. So if you like it is the confidence that it is not the original kingdom-level classification (so its confidence that it should be "Unassigned").
What do I mean by the original kingdom-level classification? The original classification is the one that the classifier would have chosen if it were asked to classify all the way down to species level, regardless of confidence.
Now for the update. We now calculate confidence values by summing the probability-like outputs of the scikit-learn classifier (usually the multinomial naive Bayes classifier). If the probability-like output for the original classification (down to species level) is not greater than the confidence parameter, the classifier sums all of the probablity-like outputs for the classifications that match the original classification down to genus level and compares again. This procedure is repeated until the sum exceeds the confidence parameter or we end up with "Unassigned".
Ok, but what is a probability-like output? Many scikit-learn classifiers output probability-like outputs for each class that they could predict (via the predict_proba method). I'm calling them probability-like because they add up to one and each is between zero and one, but they are not usually probabilities in the sense that they attempt to accurately model the probability of an event.
Why the change? It is faster (by about 100 times) and it also works better with a wider range of scikit-learn classifiers. We benchmarked extensively to ensure there was no loss in classification performance.
For the curious, the code is here. I know, it works in the opposite order to how I explained it but it was easier to explain it that way and it does the same thing. I also know that there are faster ways to do this. They are on our to do list.
I hope that helps, please follow up with any resulting confusion.
Hi @BenKaehler, I deeply appreciate you taking time to give me this thoughtful response! Can I ask some follow-up questions?
[Q1] By a "probability-like output", do you mean likelihood then? I'm just curious in case I have to explain this to someone (I looked at the predict_probamethod, but I still couldn't figure it out).
[Q2] Could you please elaborate what you mean by "... at the highest level (usually kingdom) ..."? Shouldn't it be always at the kingdom level? Do you mean sometimes the classifier would output "Unassigned" without ever trying the kingdom-level classification (e.g. output "Unassigned" at the phylum level)?
[Q3] If I understood your statement above correctly, the ASV bd3cc9947a6f5180917fd5200b25e79b in my original post had a confidence value 0.226745148 = 1 - 0.773254852 for the original kingdom-level classification. Is that correct? And the taxonomy.qzv file is simply displaying the confidence value that the ASV should be "Unassigned" (i.e. 0.773254852 instead of 0.226745148). However, I'm still not clear why we couldn't just report 0.226745148 like many other ASVs that are classified as "Unassigned" and have low confidence (c < 0.7). For example, an ASV called c6fb3d68528584c67a521df928644474 in the file has a confidence value of 0.30000612182976194. Should I interpret this as the confidence it should be "Unassigned" or as the confidence it is the original kingdom-level classification? I hope you can see my confusion here.
Thank you so much for engaging this conversation with me. [Q3] is my major concern, so I would greatly appreciate if you can at least answer that question.
No, I don't mean likelihood, whatever that is. I just mean that the output for each class is between zero and one and that the outputs summed over all the classes is one. Machine learning classifiers frequently produce such outputs, but seldom make any guarantees about what the outputs mean. For instance, for the naive Bayes classifier, the outputs are produced using formulae that look like posterior probability calculations, but they're not, because probabilistically speaking the assumptions made in the calculations are very wrong (hence "naive"). Like all machine learning approaches, though, we use it because it works.
I said "usually" because you don't have to use Greengenes, or SILVA, or a database that cares about kingdoms. (For instance, you could build one that uses domain instead of kingdom.) So it refers to possible complications regarding the database you use, not the classifier itself.
That is correct.
In that instance, it means that the classifier's confidence in the original kingdom was 0.69999387817, or not quite 0.7. On the one hand it's unfortunate that it just missed out, but on the other hand this is one of those questions where if you're asking it, the answer doesn't really matter. That is if you're not quite sure whether it's even a bacterium, "Unassigned" is probably the right answer.
So the classifier is ~0.3 confident that it is "Unassigned" or ~0.7 confident that it is a bacterium. It is worth noting that what we're calling "confidence" here is just a mechanism for choosing when to abstain from classification. It's not related to any formal notion of confidence (like in "confidence intervals"), it just gets called that because it's what the RDP Classifier's creators chose to call it, AFAIK.