I am trying to classify sequences using the silva reference (515F/806R) on the latest release of QIIME, however since I am running this in a shared compute cluster my job keeps getting killed as it exceeds memory requirements.
When I submit this job, I usually allocate a full node with 32 cores and 256 GB of memory, however after about an hour the memory usage keeps growing, until it goes above the initial allocation and makes the scheduler kill the job.
Side question: what are the memory requirements of the classifier a function of? My impression is that the reference would be the only factor, but maybe I'm wrong.
I think this post basically answers your question. But in short using --p-n-jobs 8 on silva is probably going to take a lot of memory and we don't have a good way to know what the right number for n_jobs would be either.
When I monitored the job, one of the workers was using around ~130 GB of memory (), the other workers were using roughly 10GB of memory. I've resubmitted the job with 1 job, it's been running for about 4 hours, and it's using ~80GB of memory (). Is this to be expected from the Silva database? Other posts mention increased memory requirements for this database, but not any specifics of what to expect in terms of memory usage.
EDIT:
The job has been running for 20 hours ~80 GB of memory. Note that the input file runs in ~15 minutes using the gg database.
That doesn't seem right to me, @Nicholas_Bokulich, @BenKaehler, is there another parameter that should be set here? Are there certain kinds of inputs that can trigger a very bad worst-case behavior in the classifier?
@yoshiki@ebolyen I believe the parameter you are looking for is chunk-size. @BenKaehler will need to offer more insight on proper settings for this parameter, but you could start experimenting with that parameter while waiting for a response from Ben.
Not yet, the process has been running for 43 hours now, no results yet. Note that I have ~20,000 sequences to classify, so it's not that much. I don't think this is actually working, but it's worth waiting.
We get a similar problem on our linux compute cluster: greengenes works but silva does not, and the error message says "OSError: [Errno 28] No space left on device". The ~9.6 GB silva database is being opened into the ~9GB /tmp directory on the compute nodes, but cannot complete and then the job stops. Is there some way to direct the command to use a different place to open the database? I cannot find an environment var that I can overwrite. I may have to ask the admin to increase the /tmp dir size, if possible. The compute nodes BTW have ~500 GB RAM, so that is not the issue.
But maybe this is not the problem that yoshiki is having as his runs for hours before quitting. His database must be open, but perhaps the output is also first written to a small tmp space.
To address that problem I think you might be able to set the TMPDIR environment variable to a directory where you have enough space so you wouldn't get that OSError.
Yes, I did just that and the "moving picture" dataset ran inside of an hour, with either 1 or 64 cores. Not sure why I do not see much of a gain with more cores. Thanks, Bill
Others have had issues with the silva classifier. My last response on the issue is here.
In summary, when I run either of the pre-trained silva classifiers on my laptop they use up a maximum of around 11GB of memory with a minimal number of rep seqs. One way to bring the memory usage down is to reduce the chunk size using the --p-chunk-size parameter. Starting with a chunk size of 1,000 is one way to start debugging this problem. (And yes, changing --p-n-jobs from its default of 1 will cause multiple copies of the classifier artifact to be loaded into memory.)
@ebolyen, is it possible that cluster users are running into problems if large artifacts are getting unzipped into temp directories that are mounted over a network? Setting TMPDIR, or whatever the variable is on the user's cluster environment, to scratch that is local to the worker nodes may help.
Just by way of explanation, the silva database has always been troublesome, partly because of the large number of unique taxonomies (82,325 in the taxonomy file I'm looking at).
@yoshiki, are you still waiting for your 20,000 sequences to finish? If you are able to share them with me I'm happy to debug.
Thanks for the pointers @BenKaehler! Would you mind explaining what "chunk size" is, and how it relates to the number of jobs running and the memory requirements of each job? I don't have a good mental model of this, so it's tough for me to fiddle with these parameters.
To bring closure to the posts above:
I am no longer waiting on that job, also I made a mistake in my post above, I should have written 100K sequences instead of 20K sequences. The job with 100K sequences finished in about 3-4 days. I later processed a different dataset with 20K sequences first with 1 job and then with 8 jobs. The classification using 1 job took 9 hours walltime and ~22GB of memory, the classification using 8 jobs took 7.5 hours and ~160GB of memory. While monitoring the parallel classification (with 8 jobs) I noticed that there were 7 processes that had memory allocated to them, but were not using the CPU at all. Do you have any insight as to why this is happening?
I would guess that the reason that all the reads went to the first worker is that chunk_size is set to be large enough to effectively turn off chunking. To get some sort of reasonable parallelism it would probably be better to set chunk_size to be, maybe, 1,000.
I've added some parameter documentation to classify-sklearn. This is what it looks like now:
$ qiime feature-classifier classify-sklearn --help
Usage: qiime feature-classifier classify-sklearn [OPTIONS]
Classify reads by taxon using a fitted classifier.
Options:
--i-reads PATH Artifact: FeatureData[Sequence] [required]
The feature data to be classified.
--i-classifier PATH Artifact: TaxonomicClassifier [required]
The taxonomic classifier for classifying the
reads.
--p-chunk-size INTEGER [default: 262144]
Number of reads to process
in each batch.
--p-n-jobs INTEGER [default: 1]
The maximum number of
concurrently worker processes. If -1 all
CPUs are used. If 1 is given, no parallel
computing code is used at all, which is
useful for debugging. For n_jobs below -1,
(n_cpus + 1 + n_jobs) are used. Thus for
n_jobs = -2, all CPUs but one are used.
--p-pre-dispatch TEXT [default: 2*n_jobs]
"all" or expression, as
in "3*n_jobs". The number of batches (of
tasks) to be pre-dispatched.
--p-confidence FLOAT [default: 0.7]
Confidence threshold for
limiting taxonomic depth. Provide -1 to
disable confidence calculation, or 0 to
calculate confidence but not apply it to
limit the taxonomic depth of the
assignments.
--p-read-orientation [same|reverse-complement]
[optional]
Direction of reads with respect
to reference sequences. same will cause
reads to be classified unchanged; reverse-
complement will cause reads to be reversed
and complemented prior to classification.
Default is to autodetect based on the
confidence estimates for the first 100
reads.
--o-classification PATH Artifact: FeatureData[Taxonomy] [required
if not passing --output-dir]
--output-dir DIRECTORY Output unspecified results to a directory
--cmd-config PATH Use config file for command options
--verbose Display verbose output to stdout and/or
stderr during execution of this action.
[default: False]
--quiet Silence output if execution is successful
(silence is golden). [default: False]
--help Show this message and exit.