The good news is I got the classifier to finally complete, but the bad news is I’m not sure I fully understand why it as failing in the first place (suspicions outlined below).
It looks like many issues posted about classifier memory are from users testing the pre-trained SiLVA database with their laptop with 8-16 GB or RAM, followed by replies indicating that they need more memory, and suggesting to try using a compute cluster or go the AWS route. My process was seg faulting on our cluster, using either 120, 500, or 800 GB memory, so while I’m still getting an error that quite distinctly seems memory related, I don’t understand how it could be crashing with a high mem node.
slurmstepd: error: Job 36840 exceeded memory limit (408895836160 > 134209339392), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 36840 ON node108 CANCELLED AT 2018-12-08T00:48:02
One think that would be valuable to understand is how the information provided in the script are ultimately getting read into memory for the process:
- My reference classifier is just over 500 MB; I’m guessing this is read into memory entirely before even one sequence is queried? I’m curious how much the decompression inflates that value to. The entire classifier must get read in at once, right?
- My representative sequence dataset contains about 10,000 sequence variants to be classified. It seems likely that this also wouldn’t be causing a significant memory issue, but maybe this should be tuned down with fewer batches? This is the only part where batching reads can reduce memory footprint (but just take longer if you have fewer reads per batch)?
My suspicion as to why it finally completed without the memory error… I think I’m misinterpreting the documentation describing how I should be allocating CPUs. With our SLURM job manager on our cluster we define the number of cpus to use per task and the number of tasks (which I’m interpreting as “threads” and “cpus”, respectively). I wonder if the --p-n-jobs
parameter is supposed to be referring to the number of CPU nodes, or the number of threads per CPU. When I think about multithreading (like with vsearch) I usually think of that --p-n-jobs
parameter as referring tot he number of “threads”, but maybe that’s where I’m mistaken. Is it referring to the number of cpus instead? If it is referring to the number of CPUs, I’m wondering what to specify in terms of the number of tasks per CPU - the threads thing. It doesn’t seem like there is a parameter in this script for that.
The program would always crash, no matter how much memory I gave it (up to 800Gb) if I specified 24 CPUs in my SLURM script, and used the --p-n-jobs
parameter set to -1
. I did that thinking that was how to use all the threads, but perhaps it’s my mistake that it’s trying to use all the CPUs on the cluster, which my SLURM script is specifying not to do…
Once I switched that parameter back to 1
, the program finished the job with the regular 128GB RAM node after about 8 hours.
Thanks for any input you can provide about how that -n-jobs
parameter is supposed to be interpreted with regards to threads vs. cpus.