Speeding up dada2 denoise-paired

Hi,

I have a question about the dada2 denoise-paired command.
I'm working with a large 16S dataset (1138 fastq files, 58 GB), and when I run this command, it takes more than 10 days to run. I work on a computer cluster, and run this command as a job on the cluster, and the cluster does not allow jobs that run more than 10 days, which means my job gets killed. I've run the exact same command also for a smaller dataset (940 files, 42 GB) which ran smoothly in ~20 hours using 8 G. I don't understand why it takes that much longer on the larger dataset. Is there any way I could speed up this analysis so I can do this within 10 days also with the larger dataset?

The command I used:
qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux.qza
--p-trim-left-f 20
--p-trim-left-r 20
--p-trunc-len-f 280
--p-trunc-len-r 215
--p-pooling-method 'pseudo'
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza

I'm using QIIME2/2023.7.0 on a computer cluster. Let me know if you need more information to help.

Many thanks in advance!

Hi @emma1 ,

Welcome to the :qiime2: Forum

You have a lot of samples! Do you know if they come from the same sequencing run? DADA2 works assuming that, so if your FASTQ files come from different runs you should run DADA2 once per run and then merge the feature tables if you wish (source).

That being said, if you want to speed up the process here you have some suggestions / ideas (ordered by what I would try first):

  • Have you tried multithreading? Check out the --p-n-threads option (docs here).
  • --p-pooling-method independent is the default option, and should be faster (although people in the forum suggest that pseudo is also a sensible option, e.g. see here).
  • Is there a possibility that DADA2 hangs due to insufficient RAM and your cluster is not killing the job? That would explain why the time limit is reached. Cluster output and error files would be useful to investigate on this.
  • Most clusters have a short and long queue for jobs; and some clusters also have something like a "no limits" queue that you can use under special conditions if you contact the cluster maintainers.

Best,

Sergio

2 Likes