Error running feature-classifier classify-sklearn (memory related)

matrs · August 11, 2019, 4:07am

I'm using q2cli version 2019.7.0

When I run:

qiime feature-classifier classify-sklearn --i-classifier silva-132-99-515-806-nb-classifier.qza \
--i-reads rep-seqs_230-200_trunc.qza  ...

I'm getting the following error:

...

joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation f
ault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGBUS(-7)}

Plugin error from feature-classifier:

  A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory us
age causing the Operating System to kill the worker. The exit codes of the workers are {SIGBUS(-7)}

See above for debug info.

I'm in a shared cluster, in a interactive session where I asked for 24 cores and 60 GB of RAM. I tried to run this with --p-n-jobs 12 and I got this error and then with --p-n-jobs 6, getting the same error again. So I checked the process with top and I saw that for each job , 6 in total in this case, a memory consumption of 20GB (VIRT) per job and about 8/9 GB RES per job. Of course I'll try it with only one job but It was a surprising to me that every job can consume that amount of memory. It seems that in linux, working with silva, is kind of hard to use more than a couple of cores if every core/job consumes such amount of memory.
Have you experienced such memory consumption?

https://imgbbb.com/images/2019/08/11/top_memory_qiime2_nb-classifier.jpg

Edit: When I run it with less cores and the classification finished without errors, I could see that those almost 20 GB per job were a peak, which occurs when multiple jobs get started (which isn't the beginning of the run), then every job stabilized its use of memory around 13-14 GB and ended that way, so I always got the error at the beginning of the multi-core part of the run.

Nicholas_Bokulich · August 13, 2019, 2:08pm

Welcome to the forum @matrs, and thanks for asking about this/troubleshooting:

Yes, this behavior is (unfortunately) expected — the reason is that each core is storing a separate copy of the classifier in memory, hence the rapid memory consumption at the start of each job. The solution is as you have discovered: either reduce the number of jobs or increase the memory available to each job (if possible).

The SILVA database is also quite large and consumes a great deal of memory. Using Greengenes or another smaller database will require around 4X less memory.