Does higher number of sockets limit dada2-denoise speed?

Dear Tech Supporters
I am running qiime2-2019.1 on a virtual linux machine via docker. My virtual machine has 60 GB of RAM and 12 CPU (2 cores per socket > 6 sockets).
I have 195 Samples from 4 MiSeq runs (my imported .qza has 15 GB).
So far everything ran very smoothly (data import and training feature-classifier). Then I started dada2 denoise-paired using the following command:

sudo docker run -t -i -v $(pwd):/data qiime2/core:2019.1 qiime dada2 denoise-paired \

–i-demultiplexed-seqs demux_SWL_best.qza
–p-trim-left-f 5
–p-trim-left-r 5
–p-trunc-len-f 295
–p-trunc-len-r 274
–p-n-threads 0
–o-representative-sequences rep-seqs-dada2_SWL_best.qza
–o-table table-dada2_SWL_best.qza
–o-denoising-stats stats-dada2_SWL_best.qza

It went through filtering, learning error rates, denoising, and then started chimera removal. After more than 6 days I stopped the command since a colleague who runs qiime2 via docker on a similar linux VM (100GB RAM, 16 CPU) told me for him it takes only ~10h to process 6-7 MiSeq runs (~45GB imported .qza). One of the few clear system differences is that he has 8 cores per socket > 2 sockets. Do you think this is the important difference? I can’t believe the 4 more cores he has makes all the difference from 10h to almost a week (and RAM doesn’t seem to be limiting when looking at the system’s performance).
Thank you so much!

Hi @gkma,

Welcome to the forum! I agree with your skepticism that 16 vs 12 cpu cores is the difference we see.

Most likely, the difference is just to do with the structure of your data, which for whatever reason, is more complex than your colleague’s. As an example, I was once trying to reproduce issue someone was having with a 12gb run, and it took about a week to finish (I was not able to reproduce their issue unfortunately).

This was before some of the speedups made it into the conda package of DADA2, so I would expect it to take less time if I were to repeat it, but hopefully you can see that one week for 15gb isn’t too outrageous.

Something else that occurs to me, is you mentioned there were four runs in total? DADA2 trains the error model it uses for denoising based on the run itself. So if you are denoising all four at once in the same step, then it is possible that that is why it is taking as long as it is: because it cannot converge on a consistent error model (there isn’t one).

My advice would be, if you are denoising each run separately and the total time is over 6 days, don’t worry about it and keep going, it shouldn’t take much longer than that!

If you are not denoising separately, then I would recommend you do so, and then merge the tables and representative sequences at the end of all four steps*.

* Since your data appears to be paired end, you just need to make sure that your trim-lefts are the same between all runs, so that you end up with the same amplicons in the end (since you are bounded by your primers).


Thank you very much for your immediate reply!
Indeed my colleague’s data is probably (much) less complex, as he is looking at the apple microbiome while I have soil, water, and plant samples in my dataset.
Concerning the error model, what was taking very long in my case was actually the fourth step (chimera removal). So maybe there’s some difficulty in chimera removal rather than error modeling when one has several (complex) runs together?
Finally, I conclude from your reply that the number of sockets isn’t known to slow down the process considerably. So having either
6 sockets à 2 cores or
2 sockets à 6 cores
shouldn’t make the difference, the total number of cores is what matters. Do you think that’s right?

dada2 denoise-paired on my 15 GB of data finally completed. So for the record, it took almost 11 days with 12 cores and 60GB of RAM. Distribution of cores on sockets didn’t seem to play a role. It turned out quite a lot of the merged reads were removed as chimeric, that’s probably why it took that long. The sample which lost most reads in the whole denoising process ended up with roughly 2% of the initial input reads.