how to select metadata column for q2-sample-classifier

Hi,
I was trying to run the following command with my dataset:
qiime sample-classifier classify-samples
–i-table moving-pictures-table.qza
–m-metadata-file moving-pictures-sample-metadata.tsv
–m-metadata-column body-site
–p-optimize-feature-selection
–p-parameter-tuning
–p-estimator RandomForestClassifier
–p-n-estimators 20
–p-random-state 123
–output-dir moving-pictures-classifier
BUT I got following error:
Plugin error from sample-classifier:

You have chosen to predict a metadata column that contains one or more values that match only one sample. For proper stratification of data into training and test sets, each class (value) must contain at least two samples. This is a requirement for classification problems, but stratification can be disabled for regression by setting stratify=False. Alternatively, remove all samples that bear a unique class label for your chosen metadata column. Note that disabling stratification can negatively impact predictive accuracy for small data sets.

Debug info has been saved to /tmp/qiime2-q2cli-err-5tchsqkm.log

So how to fix this problem? there is no such option of stratify=False in above command. Also it is not recommended as seen from error message. The metadata column I am selecting is non-numeric category. I saw the tutorial and I think there is not any error in sample metadata file.
Please suggest.
Thanks and Regards

Hi @sanda,
I sounds like you are attempting to classify a dataset that is too small to provide useful results. I realize you are probably just testing out this method on the moving pictures data before proceeding with other data, so just be aware that you should have a much larger dataset (see the tutorial for more details).

See the error message:

Good luck!

2 Likes

Hi @Nicholas_Bokulich,
Thanks for your reply.
Yes, you are right, I was trying to test this method because I found it interesting and relevant. I read the tutorial. As mentioned minimum requirement is of 50 samples. In my case it is around 60 but after merging my data (which I would do in future) it would be more than 100.
Only concern is that in tutorial it is mentioned as “categorical metadata columns that are used as classifier targets should have a minimum of 10 samples per unique value…” So I am getting confused here. Because If I see the tutorial sample data where they have used “body site” as metadata column, I didn’t find 10 samples per unique value. So I would like to know about choosing metadata column. “10 samples per unique value” I might couldn’t understand this well. Can you suggest me key points here?

Thanks and best regards,

The tutorial dataset is a minimal dataset that is used there for a quick example, not representative of a full-scale analysis.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.