Classifier training parameters

devonorourke · December 8, 2018, 8:36am

The good news is I got the classifier to finally complete, but the bad news is I'm not sure I fully understand why it as failing in the first place (suspicions outlined below).

It looks like many issues posted about classifier memory are from users testing the pre-trained SiLVA database with their laptop with 8-16 GB or RAM, followed by replies indicating that they need more memory, and suggesting to try using a compute cluster or go the AWS route. My process was seg faulting on our cluster, using either 120, 500, or 800 GB memory, so while I'm still getting an error that quite distinctly seems memory related, I don't understand how it could be crashing with a high mem node.

slurmstepd: error: Job 36840 exceeded memory limit (408895836160 > 134209339392), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 36840 ON node108 CANCELLED AT 2018-12-08T00:48:02

One think that would be valuable to understand is how the information provided in the script are ultimately getting read into memory for the process:

My reference classifier is just over 500 MB; I'm guessing this is read into memory entirely before even one sequence is queried? I'm curious how much the decompression inflates that value to. The entire classifier must get read in at once, right?
My representative sequence dataset contains about 10,000 sequence variants to be classified. It seems likely that this also wouldn't be causing a significant memory issue, but maybe this should be tuned down with fewer batches? This is the only part where batching reads can reduce memory footprint (but just take longer if you have fewer reads per batch)?

My suspicion as to why it finally completed without the memory error... I think I'm misinterpreting the documentation describing how I should be allocating CPUs. With our SLURM job manager on our cluster we define the number of cpus to use per task and the number of tasks (which I'm interpreting as "threads" and "cpus", respectively). I wonder if the --p-n-jobs parameter is supposed to be referring to the number of CPU nodes, or the number of threads per CPU. When I think about multithreading (like with vsearch) I usually think of that --p-n-jobs parameter as referring tot he number of "threads", but maybe that's where I'm mistaken. Is it referring to the number of cpus instead? If it is referring to the number of CPUs, I'm wondering what to specify in terms of the number of tasks per CPU - the threads thing. It doesn't seem like there is a parameter in this script for that.

The program would always crash, no matter how much memory I gave it (up to 800Gb) if I specified 24 CPUs in my SLURM script, and used the --p-n-jobs parameter set to -1. I did that thinking that was how to use all the threads, but perhaps it's my mistake that it's trying to use all the CPUs on the cluster, which my SLURM script is specifying not to do...

Once I switched that parameter back to 1, the program finished the job with the regular 128GB RAM node after about 8 hours.

Thanks for any input you can provide about how that -n-jobs parameter is supposed to be interpreted with regards to threads vs. cpus.

Nicholas_Bokulich · December 10, 2018, 2:50pm

Yes, the entire classifier is read into memory before one sequence is queried. So that explains why you had immediate memory issues before... you were reading in N classifiers to memory (where N = number of jobs you are running). 500 MB would decompress to a much larger size (not sure how large), so all in all it is not unimaginable that 24 jobs would eat up 1 TB of RAM in no time.

Yep, a large number of queries will impact memory usage, and this is where batch size matters. (and more batches does increase time a little bit).

You should read up on what scikit-learn does in this regard — sorry, I don't know the actual answer so you should get it straight from the horse's mouth! (and then let me know — I've made the same assumptions you have so would like to know if I'm wrong)

devonorourke · December 11, 2018, 10:42pm

Hi @Nicholas_Bokulich,
Here is the horses documentation. In looking through the qiime script implementing scikit-learn, it looks like the n_jobs parameter is implemented beginning in line 40, where you define the predict function.
In that code, you specify incorporate n_jobs parameter, but do not define the particular backend parameter which will modify the meaning that n_jobs is interpreted as. Looking at the very beginning of the joblib.Parallel documentation makes it clear that you can interpret the n_jobs parameter either as the number of CPUs to be used, or as the number of threads, depending on how you specify that backend parameter:

Parameters:	
n_jobs: int, default: None
The maximum number of concurrently running jobs, such as the 
number of Python worker processes when backend=”multiprocessing” 
or the size of the thread-pool when backend=”threading”.

So I think, if I'm understanding anything about this properly, there is more to it than QIIME's current implementation. If you want to leave it as is, it looks like it defaults to CPU usage, but if you want to be able to use this script and specify threads, then more needs to be added to the code. I think, at a minimum, it requires the addition of the prefer argument in joblib.Parallel, but again, I'm not certain. Whomever in QIIME made this plugin possible is probably going to understand exactly what modification is necessary, and my guess is it won't be a huge addition.

Hope this sheds some light on the situation rather than muddying the waters, but who knows.

Thanks!

Nicholas_Bokulich · December 11, 2018, 10:54pm

Thanks @devonorourke

You're looking at him. Yep, sounds like a simple modification. I will need to mull this a bit more before deciding what is appropriate.