Stratify by non numeric metadata category for Sample-classifier regress-samples

vrbana · October 25, 2018, 7:39pm

Hello, would it be possible to have an option to input a metadata category to stratify by that is not the numeric one for the regression? I get the error message:

“You have chosen to predict a metadata column that contains one or more values that match only one sample. For proper stratification of data into training and test sets, each class (value) must contain at least two samples. This is a requirement for classification problems, but stratification can be disabled for regression by setting stratify=False. Alternatively, remove all samples that bear a unique class label for your chosen metadata column. Note that disabling stratification can negatively impact predictive accuracy for small data sets.”

I have a numeric metadata category that I would like to predict, however, I would also like to stratify by a non-numeric category for the testing and training sets (i.e treatment group).

Nicholas_Bokulich · October 25, 2018, 8:23pm

Hi @vrbana,

It is not possible to stratify by 2 columns (looks like this is not possible in scikit-learn, either, which does the splitting, so it is not a trivial modification to support this).

You can do a custom split on your feature data, e.g., with qiime feature-table filter-samples — but it would be complicated so I don't have guidelines. This would effectively bypass the splitting step of the pipeline, and you would need to run qiime sample-classifier fit-regressor, predict-regression, and scatterplot manually instead of using regress-samples. Similarly, if you want to stratify on a different column from your target column, you can run split-table followed by those other steps to bypass the normal pipeline structure (which assumes stratification on the target column).

I hope that helps!

system · November 26, 2018, 2:23am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.