Feature-classifier classify-sklearn. Memory error

timanix · March 9, 2019, 12:16pm

Dear all.
I am struggling for a while with a memory error from classifier. I trained it by myself using qiime2 2019.01. It worked fine with smaller dataset recently.
Here is the command I am using:

!qiime feature-classifier classify-sklearn \
  --i-classifier training-feature-classifiers/classifier.qza \
  --i-reads rep-seqs.qza \
  --o-classification taxonomy.qza

And I am receiving an error:

Plugin error from feature-classifier:
Debug info has been saved to /home/bio/anaconda2/tempfiles/qiime2-q2cli-err-x96sso5f.log

Here is the log:
Traceback (most recent call last):
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/q2cli/commands.py", line 274, in call
results = action(**arguments)
File "</home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/decorator.py:decorator-gen-338>", line 2, in classify_sklearn
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/action.py", line 231, in bound_callable
output_types, provenance)
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/action.py", line 365, in callable_executor
output_views = self._callable(**view_args)
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_feature_classifier/classifier.py", line 215, in classify_sklearn
confidence=confidence)
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_feature_classifier/_skl.py", line 45, in predict
for chunk in _chunks(reads, chunk_size)) for m in c)
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 917, in call
if self.dispatch_one_batch(iterator):
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 549, in init
self.results = batch()
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 225, in call
for func, args, kwargs in self.items]
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 225, in
for func, args, kwargs in self.items]
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_feature_classifier/_skl.py", line 52, in _predict_chunk
return _predict_chunk_with_conf(pipeline, separator, confidence, chunk)
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_feature_classifier/_skl.py", line 66, in _predict_chunk_with_conf
prob_pos = pipeline.predict_proba(X)
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/utils/metaestimators.py", line 118, in
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/pipeline.py", line 382, in predict_proba
return self.steps[-1][-1].predict_proba(Xt)
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/naive_bayes.py", line 104, in predict_proba
return np.exp(self.predict_log_proba(X))
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/naive_bayes.py", line 86, in predict_log_proba
log_prob_x = logsumexp(jll, axis=1)
File "/home/bio/anaconda2/envs/qiime2-2019.1/lib/python3.6/site-packages/scipy/special/_logsumexp.py", line 112, in logsumexp
tmp = np.exp(a - a_max)
MemoryError

I supposed that I have too little space on my system partition, so I exported TMPDIR to another partition with a lot of space, but I am still receiving the same mistake. I have about 128 GB of RAM and 600 GB of free space on ROM, no heavy processes in parallel. In my current dataset about 410 samples. I was using Jupyter Lab instead of usual terminal, everything worked with smaller dataset.
Now I am trying to repeat this command from a terminal, but I don't think thats it is an issue.
Please, help me figure out what I need to do to solve it.
Many thanks.

Update: Got the same error from the terminal.

thermokarst · March 11, 2019, 4:16pm

Not quite, a MemoryError refers to RAM, which is different than your hard drive's space.

timanix · March 11, 2019, 4:47pm

Hi, thank you for your reply, but I have 128 gb of RAM. I forgot to put --p-n jobs parameter and I was trying to repeat it with more threads, but right now I cant access my working computer (I am on the vacation and have some problems with TeamViewer).

thermokarst · March 11, 2019, 4:50pm

It sounds like this might not be enough for you.

Adding multiple jobs here comes with a memory penalty, more jobs need more memory.

timanix · March 11, 2019, 5:02pm

So I can divide my samples and then combine output files or it will be better to use another options for classifying? Could you advise me what will be the optimal solution in such case?

thermokarst · March 12, 2019, 1:31pm

No, that approach is not recommended here.

You could use a different database (for example, one that is smaller), or you could try setting the chunk-size parameter to a lower value.

timanix · April 17, 2019, 10:58am

I forgot to thank you for your solution.

It was renamed and this string

--p-reads-per-batch 10000 \

helped me. I used only one thread this time. One can play with it to decrease time required.

Thank you for helping me out.