Question about ASVs that are taxonomically classified as "Unassigned" despite of having a high confidence value (e.g. >0.7)

BenKaehler · September 24, 2020, 12:02am

Hi @sbslee, thanks for your question.

An update to my original post is well overdue.

To answer your question, if it is not possible to classify at the highest level (usually kingdom) at the required confidence level, the classifier outputs "Unassigned" for the classification and for the confidence it returns one minus the confidence that it had in the original kingdom-level classification. So if you like it is the confidence that it is not the original kingdom-level classification (so its confidence that it should be "Unassigned").

What do I mean by the original kingdom-level classification? The original classification is the one that the classifier would have chosen if it were asked to classify all the way down to species level, regardless of confidence.

Now for the update. We now calculate confidence values by summing the probability-like outputs of the scikit-learn classifier (usually the multinomial naive Bayes classifier). If the probability-like output for the original classification (down to species level) is not greater than the confidence parameter, the classifier sums all of the probablity-like outputs for the classifications that match the original classification down to genus level and compares again. This procedure is repeated until the sum exceeds the confidence parameter or we end up with "Unassigned".

Ok, but what is a probability-like output? Many scikit-learn classifiers output probability-like outputs for each class that they could predict (via the predict_proba method). I'm calling them probability-like because they add up to one and each is between zero and one, but they are not usually probabilities in the sense that they attempt to accurately model the probability of an event.

Why the change? It is faster (by about 100 times) and it also works better with a wider range of scikit-learn classifiers. We benchmarked extensively to ensure there was no loss in classification performance.

For the curious, the code is here. I know, it works in the opposite order to how I explained it but it was easier to explain it that way and it does the same thing. I also know that there are faster ways to do this. They are on our to do list.

I hope that helps, please follow up with any resulting confusion.