I'm running dada2 denoise and it seems to be taking a very long time (>24 hrs so far). I am aware that dada2 is known for taking a long time - but I wanted to check that what I am observing is reasonable. I have 15 million reads (2 x 300 bp paired end) representing 16 samples. So does it make sense that it has taken >24 hrs so far? I'm running on a HPC with 5 threads, allocating 5GB memory.
I'm asking this because we plan to run a lot more samples in future, and I would like to understand what I can expect in terms of timing (if we run for example 100 samples, is it going to take a month to run dada2 using such memory allocation? )
EDIT: What sequencing depth are you using? Is there a way of subsampling reads with QIIME2 before running dada2 denoise?
Yeah that should be fine, you can speed it up with more threads (for example we will use 32 sometimes on a dedicated HPC node and they will finish within 2-ish days), but I suspect it has already completed by the time of my response?
I think 12-16gb is a "safe" amount of memory for basically any job with DADA2, it doesn't need a lot, but you might be a little close to a threshold at 5gb.
Goodness no! The longest I have seen is a week (prior to some optimizations that were made) and that was with ~10gb of compressed data with only 12 or so cores.
We don't have a good way to subsample or partition data in QIIME 2, that's something we could really use though.
Thank you very much Evan! It completed after 5 days with 15 million PE reads. In the meantime I subsampled 10k and 100k reads from the fastq files to get an idea - using the tool seqtk.