DADA2 takes too much time for analysis

I am running DADA2 with about 1,200 fastq files, which is 20GB after import.

Even though the files were pretty big, it took too much time for DADA2 and actually it is still running.

Worse point is that I successfully got the results before with the 1150 samples that is also pretty big. At that time, it took about 3 days with 75 threads in same setting.

However, when I added 50 more samples to the set of 1150 samples, running didn't finish even after 10 days.

nohup qiime dada2 denoise-paired --i-demultiplexed-seqs 1-demux-all.qza --p-trunc-len-f 270 --p-trunc-len-r 210 --o-table 2-table.qza --o-representative-sequences 2-rep-seq.qza --o-denoising-stats 2-stat.qza --p-n-threads 75 &

I tried importing again and run again, but it seems nothing changes

Below is a 'htop' capture.

Is it the problem of server? I could find several threads went sleeping mode (S in htop)...

Any kinds of advice will be very helpful. Thank you in advance.

Hi @jwhuh,

The typical recommendation for DADA2 is to run it in batches by sequencing run. It's not designed to handle 1200 fastq files at once. Without logs, it's hhard to tell where things fail. I would try checking your tmp directory and maybe emptying that.

But, my ultimate recommendations would be:

  1. Split your data into sequencing runs (if you know the sequencing runs) and process on a per-run basis. Denoise everything with the same parameters. Merge the feature table and representative sequences once you're done denoising. A workflow manager or just a shell script can be helpful in running this on an HPC.
  2. Split your data into artifical runs if you dont know the original sequencing runs. Again, denoise with the same parameters and merge once you're done.
  3. Use deblur which is a more conservative algorithm, but doesn't have the error learning step and therefore can be quicker on larger sample sets.



I'm late to this thread, and I wanted to 'qiime in' to explain why this works;

With OTUs, the other sequences included during clustering changed the results. Meaning...

  • You have to include all data during clustering to get the best OTUs
  • You can't consistently merge OTU tables later

With Amplicon Sequence Variants (ASVs) like DADA2 makes, results are reliable (ideally deterministic!) sample-to-sample and run-to-run. So merging ASV tables is now possible.

In fact, DADA2 builds an error model for each sequencing run. So this should work better!


Thank you for your kind recommendations.
Data is well-split by sequencing run, so I could follow your answer right away.
I wondered whether DADA2 runing with same samples in different trials would give identical ASV abundance table. However, thanks to @colinbrislawn, I could go that way.

I appreciate your help

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.