dada2 deniose-paired never finishes; too many samples or too big input?

I am running dada2 denoise-paired in our HPC using 32 cores with 256Gb of memory, 2.9TB in /scratch.
My input file contains 146 samples and the allSamples-paired-end.qza file is 61.7GB.

I have run this command for over 8 days (199 hours) and got no output. There wasn't even any error message to look at.

 qiime dada2 denoise-paired \
 --i-demultiplexed-seqs allSamples-paired-end.qza \
 --p-trim-left-f 1 \
 --p-trim-left-r 1 \
 --p-trunc-len-f 298 \
 --p-trunc-len-r 299 \
 --p-min-fold-parent-over-abundance 4 \
 --p-max-ee-f 4 \
 --p-max-ee-r 6 \
 --p-n-threads 20 \
 --o-representative-sequences rep_seqs_dada2.qza \
 --o-table feature-table-dada2.qza \
 --o-denoising-stats denoising-stats-dada2.qza \
 --verbose

Should I be splitting my samples into smaller groups as per this topic? DADA2 takes too much time for analysis

Hi!
Splitting your dataset looks like a reasonable approach to me. If your samples were sequenced at different runs then it is not only reasonable but also recommended. You can also separate them by lanes, if this information is available. Make sure to run batches with exactly the same parameters to merge output files after.

Best,

Hi @timanix ! I'm just hopping onto this thread because I'm having a similar issue, but my dataset is on the smaller side (I think?). I have 160 samples of bacteria DNA sequences, that are about 300 bp each and the files are about 10-16 MB each. However, the denoising step with dada2 hasn't ever finished for me. I've tried restarting a couple times and have also tried using an external hard drive in case it was an issue of my computer not having enough space. I actually just tried to batch the samples and denoise with the same parameters for each batch, but even with smaller batches (5 samples per batch) it seems to not be working. Is this a common problem or likely just an issue with my own computer? Thanks in advance so much for your time!

Hello!
How much of RAM do you have on your machine? 10-16 MB per sample - is it *. gz files? If yes, looks like you have a lot of reads per sample. Dada2 requires a lot of RAM and if your samples are containing a lot of reads it may be an issue. Recently I worked with a dataset with 1M reads per sample.

Out of curiosity - did you try with one sample?

If dividing samples into batches does not working, and your samples have too much reads, try to subsample sequences (there is a plugin for demultiplexed samples).

Best,

To follow up on this; I split my samples into smaller groups and these smaller groups also never finished. I was going to try running each sample separately, but this portion of the project was stopped so I won't have the opportunity to see if it would complete if I tried each sample separately.

Hopefully you are able to try it and can report back if you were successful.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.