q2-sample-classifier minimum samples

jrw187 · November 4, 2020, 12:05am

Hello,

A while back I ran q2-sample-classifier on a dataset that I was working on. Got really good model accuracy for the categorical variable I was studying and was pretty happy with this (and then did a the regression for some numerical variables and was even happier). But in an unrelated search, I came across a reply to a post in the forum where someone was trying to run this command on a dataset with only 6 samples and the response was that you shouldn't be using this method if you don't have at least ~50 samples.

As my dataset has 40 samples, I was a bit concerned and so then went and looked at the estimator selection flowchart in the scikit-learn documentation. Again, from this 50 seems to be an important threshhold.

How much of a problem is it that I am shy of that threshold? Would the results that I got from running the random forest models with this plug-in still be reliable enough or would you recommend shelving the results I got from this and focusing on other analyses?

Thanks in advance for your thoughts on this.

Nicholas_Bokulich · November 4, 2020, 6:45am

Hi @jrw187,
50 is also low, but this is really just a "rule of thumb"... more would always be merrier, but it also depends slightly on complexity (e.g., how many classes you are predicting), and your goals (e.g., to find interesting patterns or to develop a diagnostic test!).

With 40 samples you are on the low side... I'd say you could report the results as promising early results but acknowledge that testing additional samples (from additional studies/collections) would be needed to confirm the robustness.

Alternatively, track down some samples from related studies and see how well your classifier predicts those samples (note that study covariates, including processing effects, can muddy the waters here)

Another possibility (this does not really "fix" the problem but can help assess how wide a margin of error you have) is to run the classify-samples-ncv and regress-samples-ncv methods instead (see the tutorial for details)... this will train/classify all samples via cross-validation and report the accuracy ± standard deviation (directly in the terminal if using --verbose mode).

Good luck!

system · December 5, 2020, 12:45pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.