classify samples baseline accuracy clarification

Hoping for some clarification on what “baseline accuracy” indicates in the output from the clasify-samples program (specifically the .qzv file that is created when passing the --o-accuracy-results argument). I’m trying to follow the q2-sample-classifier preprint:

  • the term overall accuracy reflects the “percentage of test samples that were accurately classified

Just to confirm, this indicates that the subset of samples being classified end up in the right (expected) group, correct?

What I’m confused about is what the baseline accuracy is representing. From the preprint, this term indicates: “classification accuracy if all samples were classified to the most abundant class

In the context of the paper there is an indication that some datasets (like HMP and EMP) have more classes; am I correct in thinking that class in this context is indicating the groups that samples are associated with (ie. a body site)? If that’s the case, it would be great to have a simple example how this baseline accuracy would be calculated.

Thank you!

correct

correct

So imagine you have 100 test samples and 10 classes (1 through 10). 50 samples belong to class 1, with the remaining samples split among the other 9 classes. So the baseline accuracy rate is 50%, since that is the most abundant class. This is a crude but efficient way to assess classification accuracy. You can imagine other ways to calculate a "baseline accuracy" — e.g., just assume a random guess (in which case baseline accuracy is 10% in our example), but that is even more crude and probably less useful.

1 Like

Thanks @Nicholas_Bokulich. Good to know that “baseline” is a moveable target.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.