Using the silva pre-trained classifier consumes more than 256 GB of memory

yoshiki · September 11, 2017, 4:09pm

I am trying to classify sequences using the silva reference (515F/806R) on the latest release of QIIME, however since I am running this in a shared compute cluster my job keeps getting killed as it exceeds memory requirements.

This is the command I am running:

qiime feature-classifier classify-sklearn --i-classifier bleep/silva-119-99-515-806-nb-classifier.qza --i-reads bloop/representative-sequences.qza --o-classification bloop/trimmed-200nts-qiita/taxonomy.silva.qza --p-n-jobs 8

When I submit this job, I usually allocate a full node with 32 cores and 256 GB of memory, however after about an hour the memory usage keeps growing, until it goes above the initial allocation and makes the scheduler kill the job.

Side question: what are the memory requirements of the classifier a function of? My impression is that the reference would be the only factor, but maybe I'm wrong.

ebolyen · September 11, 2017, 8:17pm

Hi @yoshiki,

I think this post basically answers your question. But in short using --p-n-jobs 8 on silva is probably going to take a lot of memory and we don't have a good way to know what the right number for n_jobs would be either.

yoshiki · September 12, 2017, 12:38am

When I monitored the job, one of the workers was using around ~130 GB of memory (), the other workers were using roughly 10GB of memory. I've resubmitted the job with 1 job, it's been running for about 4 hours, and it's using ~80GB of memory (). Is this to be expected from the Silva database? Other posts mention increased memory requirements for this database, but not any specifics of what to expect in terms of memory usage.

EDIT:

The job has been running for 20 hours ~80 GB of memory. Note that the input file runs in ~15 minutes using the gg database.

ebolyen · September 12, 2017, 8:03pm

That doesn't seem right to me, @Nicholas_Bokulich, @BenKaehler, is there another parameter that should be set here? Are there certain kinds of inputs that can trigger a very bad worst-case behavior in the classifier?

Nicholas_Bokulich · September 12, 2017, 8:18pm

@yoshiki @ebolyen I believe the parameter you are looking for is chunk-size. @BenKaehler will need to offer more insight on proper settings for this parameter, but you could start experimenting with that parameter while waiting for a response from Ben.

yoshiki · September 12, 2017, 8:58pm

Yeah, it's still running, 100% CPU usage ~80 GB , currently running for a total of 24 hours.

yoshiki · September 13, 2017, 4:12am

One thing I did not mention before is that I am now rerunning this with the full-length trained classifier.

sbusi · September 13, 2017, 4:54pm

@yoshiki Any luck with the SILVA full-length classifier yet? I'm having the same issues with the 515F/806R reference (SILVA) database

yoshiki · September 13, 2017, 4:57pm

Not yet, the process has been running for 43 hours now, no results yet. Note that I have ~20,000 sequences to classify, so it's not that much. I don't think this is actually working, but it's worth waiting.

sbusi · September 13, 2017, 5:06pm

Hmmm, it helps to know that information. Have you tried the --p-classify-chunk-size paramter? I am getting an error stating, 'no such option'.

Thanks for the initial post too!

spollenw · September 14, 2017, 2:06pm

We get a similar problem on our linux compute cluster: greengenes works but silva does not, and the error message says "OSError: [Errno 28] No space left on device". The ~9.6 GB silva database is being opened into the ~9GB /tmp directory on the compute nodes, but cannot complete and then the job stops. Is there some way to direct the command to use a different place to open the database? I cannot find an environment var that I can overwrite. I may have to ask the admin to increase the /tmp dir size, if possible. The compute nodes BTW have ~500 GB RAM, so that is not the issue.
But maybe this is not the problem that yoshiki is having as his runs for hours before quitting. His database must be open, but perhaps the output is also first written to a small tmp space.

Bill

yoshiki · September 14, 2017, 2:13pm

To address that problem I think you might be able to set the TMPDIR environment variable to a directory where you have enough space so you wouldn't get that OSError.

spollenw · September 14, 2017, 4:40pm

Yes, I did just that and the "moving picture" dataset ran inside of an hour, with either 1 or 64 cores. Not sure why I do not see much of a gain with more cores. Thanks, Bill

BenKaehler · September 15, 2017, 4:37am

Hi Everyone, sorry for the slow response.

Others have had issues with the silva classifier. My last response on the issue is here.

In summary, when I run either of the pre-trained silva classifiers on my laptop they use up a maximum of around 11GB of memory with a minimal number of rep seqs. One way to bring the memory usage down is to reduce the chunk size using the --p-chunk-size parameter. Starting with a chunk size of 1,000 is one way to start debugging this problem. (And yes, changing --p-n-jobs from its default of 1 will cause multiple copies of the classifier artifact to be loaded into memory.)

@ebolyen, is it possible that cluster users are running into problems if large artifacts are getting unzipped into temp directories that are mounted over a network? Setting TMPDIR, or whatever the variable is on the user's cluster environment, to scratch that is local to the worker nodes may help.

Just by way of explanation, the silva database has always been troublesome, partly because of the large number of unique taxonomies (82,325 in the taxonomy file I'm looking at).

@yoshiki, are you still waiting for your 20,000 sequences to finish? If you are able to share them with me I'm happy to debug.

yoshiki · September 15, 2017, 3:28pm

Thanks for the pointers @BenKaehler! Would you mind explaining what "chunk size" is, and how it relates to the number of jobs running and the memory requirements of each job? I don't have a good mental model of this, so it's tough for me to fiddle with these parameters.

To bring closure to the posts above:

I am no longer waiting on that job, also I made a mistake in my post above, I should have written 100K sequences instead of 20K sequences. The job with 100K sequences finished in about 3-4 days. I later processed a different dataset with 20K sequences first with 1 job and then with 8 jobs. The classification using 1 job took 9 hours walltime and ~22GB of memory, the classification using 8 jobs took 7.5 hours and ~160GB of memory. While monitoring the parallel classification (with 8 jobs) I noticed that there were 7 processes that had memory allocated to them, but were not using the CPU at all. Do you have any insight as to why this is happening?

BenKaehler · September 15, 2017, 8:12pm

Thanks for the closure, @yoshiki.

I would guess that the reason that all the reads went to the first worker is that chunk_size is set to be large enough to effectively turn off chunking. To get some sort of reasonable parallelism it would probably be better to set chunk_size to be, maybe, 1,000.

I've added some parameter documentation to classify-sklearn. This is what it looks like now:

$ qiime feature-classifier classify-sklearn --help
Usage: qiime feature-classifier classify-sklearn [OPTIONS]

  Classify reads by taxon using a fitted classifier.

Options:
  --i-reads PATH                  Artifact: FeatureData[Sequence]  [required]
                                  The feature data to be classified.
  --i-classifier PATH             Artifact: TaxonomicClassifier  [required]
                                  The taxonomic classifier for classifying the
                                  reads.
  --p-chunk-size INTEGER          [default: 262144]
                                  Number of reads to process
                                  in each batch.
  --p-n-jobs INTEGER              [default: 1]
                                  The maximum number of
                                  concurrently worker processes. If -1 all
                                  CPUs are used. If 1 is given, no parallel
                                  computing code is used at all, which is
                                  useful for debugging. For n_jobs below -1,
                                  (n_cpus + 1 + n_jobs) are used. Thus for
                                  n_jobs = -2, all CPUs but one are used.
  --p-pre-dispatch TEXT           [default: 2*n_jobs]
                                  "all" or expression, as
                                  in "3*n_jobs". The number of batches (of
                                  tasks) to be pre-dispatched.
  --p-confidence FLOAT            [default: 0.7]
                                  Confidence threshold for
                                  limiting taxonomic depth. Provide -1 to
                                  disable confidence calculation, or 0 to
                                  calculate confidence but not apply it to
                                  limit the taxonomic depth of the
                                  assignments.
  --p-read-orientation [same|reverse-complement]
                                  [optional]
                                  Direction of reads with respect
                                  to reference sequences. same will cause
                                  reads to be classified unchanged; reverse-
                                  complement will cause reads to be reversed
                                  and complemented prior to classification.
                                  Default is to autodetect based on the
                                  confidence estimates for the first 100
                                  reads.
  --o-classification PATH         Artifact: FeatureData[Taxonomy]  [required
                                  if not passing --output-dir]
  --output-dir DIRECTORY          Output unspecified results to a directory
  --cmd-config PATH               Use config file for command options
  --verbose                       Display verbose output to stdout and/or
                                  stderr during execution of this action.
                                  [default: False]
  --quiet                         Silence output if execution is successful
                                  (silence is golden).  [default: False]
  --help                          Show this message and exit.

yoshiki · September 18, 2017, 10:15pm

Thanks for your replies @BenKaehler!

jairideout · October 11, 2017, 4:14pm

An off-topic reply has been split into a new topic: MemoryError when training classifier with SILVA

Please keep replies on-topic in the future.

yoshiki · November 11, 2017, 10:14pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.