classify samples baseline accuracy clarification

devonorourke · April 23, 2019, 3:52pm

Hoping for some clarification on what "baseline accuracy" indicates in the output from the clasify-samples program (specifically the .qzv file that is created when passing the --o-accuracy-results argument). I'm trying to follow the q2-sample-classifier preprint:

the term overall accuracy reflects the "percentage of test samples that were accurately classified"

Just to confirm, this indicates that the subset of samples being classified end up in the right (expected) group, correct?

What I'm confused about is what the baseline accuracy is representing. From the preprint, this term indicates: "classification accuracy if all samples were classified to the most abundant class"

In the context of the paper there is an indication that some datasets (like HMP and EMP) have more classes; am I correct in thinking that class in this context is indicating the groups that samples are associated with (ie. a body site)? If that's the case, it would be great to have a simple example how this baseline accuracy would be calculated.

Thank you!

Nicholas_Bokulich · April 23, 2019, 4:19pm

correct

So imagine you have 100 test samples and 10 classes (1 through 10). 50 samples belong to class 1, with the remaining samples split among the other 9 classes. So the baseline accuracy rate is 50%, since that is the most abundant class. This is a crude but efficient way to assess classification accuracy. You can imagine other ways to calculate a "baseline accuracy" — e.g., just assume a random guess (in which case baseline accuracy is 10% in our example), but that is even more crude and probably less useful.

devonorourke · April 23, 2019, 4:51pm

Thanks @Nicholas_Bokulich. Good to know that "baseline" is a moveable target.