Sample-classifier > regress-samples forcing usage of stratification

sbslm · September 26, 2018, 7:04pm

Qiime2 v2018.8 seems to be forcing stratification of data even when the --p-no-stratify flag is used.
Here's the traceback:

Traceback (most recent call last):
File "/opt/conda/envs/qiime2-2018.8/lib/python3.5/site-packages/q2cli/commands.py", line 274, in call
results = action(**arguments)
File "", line 2, in regress_samples
File "/opt/conda/envs/qiime2-2018.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 231, in bound_callable
output_types, provenance)
File "/opt/conda/envs/qiime2-2018.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 455, in callable_executor
outputs = self._callable(scope.ctx, **view_args)
File "/opt/conda/envs/qiime2-2018.8/lib/python3.5/site-packages/q2_sample_classifier/classify.py", line 140, in regress_samples
stratify, missing_samples=missing_samples)
File "", line 2, in split_table
File "/opt/conda/envs/qiime2-2018.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 231, in bound_callable
output_types, provenance)
File "/opt/conda/envs/qiime2-2018.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 362, in callable_executor
output_views = self._callable(**view_args)
File "/opt/conda/envs/qiime2-2018.8/lib/python3.5/site-packages/q2_sample_classifier/classify.py", line 228, in split_table
stratify=True, missing_samples=missing_samples)
File "/opt/conda/envs/qiime2-2018.8/lib/python3.5/site-packages/q2_sample_classifier/utilities.py", line 390, in _prepare_training_data
features, targets, column, test_size, strata, random_state)
File "/opt/conda/envs/qiime2-2018.8/lib/python3.5/site-packages/q2_sample_classifier/utilities.py", line 169, in _split_training_data
_stratification_error()
File "/opt/conda/envs/qiime2-2018.8/lib/python3.5/site-packages/q2_sample_classifier/utilities.py", line 189, in _stratification_error
'You have chosen to predict a metadata column that contains '
ValueError: You have chosen to predict a metadata column that contains one or more values that match only one sample. For proper stratification of data into training and test sets, each class (value) must contain at least two samples. This is a requirement for classification problems, but stratification can be disabled for regression by setting stratify=False. Alternatively, remove all samples that bear a unique class label for your chosen metadata column. Note that disabling stratification can negatively impact predictive accuracy for small data sets.

##########################

I have tried the same command under v2018.6 and it works fine, it seems to be unique to v2018.8
I have tried this so far with 2 separate datasets from different projects and I got exactly the same results (error with 2018.8, stratification is called even with the flag, works with 2018.6).

Nicholas_Bokulich · September 26, 2018, 7:34pm

Thanks for reporting @sbslm! Good catch, you found a nice bug.

I did a major re-haul of how table splitting (and everything else in regress_samples) happens in 2018.8, and it looks like I accidentally hardcoded stratification to always occur by default.

I have fixed the code and this fix will be available in the October release.

If you can't wait that long, you can make a very easy fix to the code:

If this pull request is merged by the time you try this:

Clone the q2-sample-classifier repository.
In your terminal type: pip install -e .

If that pull request is not merged:

Clone the q2-sample-classifier repository.
cd q2_sample_classifier (change into that directory in your terminal)
Find this line in your code.
Change stratify=True to stratify=stratify in a text editor.
In your terminal type: pip install -e .

And you should be ready to roll.

Thanks for reporting! I hope that helps.

sbslm · September 26, 2018, 7:52pm

Thanks for the quick answer @Nicholas_Bokulich ! Awesome work !
Glad to be of help

Cheers

system · October 28, 2018, 1:52am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.