DADA2 denoising taking long time

Issue:

I'm experiencing an extremely long runtime with the denoising (dada2) step in my QIIME2 pipeline. The process has been running for 10 days now with no completion in sight, and I'm trying to figure out what might be going wrong.

  1. QIIME2 Version: 2024.10.1

Dataset Information

  • Number of samples: 152 [16s - CO1]

  • Read type: [paired-end]

  • Demux file size: [14.9GB]

Command Used

#!/bin/bash

#PBS -N denoise_co1-RP

#PBS -q cpu-q

#PBS -l nodes=1:ppn=30

#PBS -l mem=120gb

#PBS -V

#PBS -j oe

#PBS -o /pfs/home/ruba/projects/qiime2-amplicon/animal_diet/co1/dada2_denoising/trial1/denoising.log

# Truncation settings (EDIT BASED ON qzv)

TRUNC_LEN_F=180 # Truncate forward reads at this position

TRUNC_LEN_R=225 # Truncate reverse reads at this position

echo "Using trunc-len-f: $TRUNC_LEN_F, trunc-len-r: $TRUNC_LEN_R" >> "$LOGFILE"

# Run DADA2

qiime dada2 denoise-paired \

--i-demultiplexed-seqs "$INPUT_QZA" \

--p-trunc-len-f $TRUNC_LEN_F \

--p-trunc-len-r $TRUNC_LEN_R \

--p-n-threads 0 \

--o-table "$OUTPUT_DIR"/table.qza \

--o-representative-sequences "$OUTPUT_DIR"/rep-seqs.qza \

--o-denoising-stats "$OUTPUT_DIR"/stats.qza \

--verbose >> "$LOGFILE" 2>&1

Current System Status (top command) :

image

Log File Contents:

DADA2 started at: Mon Nov 17 11:10:12 IST 2025
Using trunc-len-f: 180, trunc-len-r: 225
R version 4.3.3 (2024-02-29)
Loading required package: Rcpp
DADA2: 1.30.0 / Rcpp: 1.0.13.1 / RcppParallel: 5.1.9
2) Filtering ............................................................................
3) Learning Error Rates
347333760 total bases in 1929632 reads from 2 samples will be used for learning the error rates.
434167200 total bases in 1929632 reads from 2 samples will be used for learning the error rates.
3) Denoise samples ............................................................................
............................................................................
5) Remove chimeras (method = consensus)

Could anyone help me to figure out what and where things are going wrong and how to solve this problem. Any guidance would be greatly appreciated. Thank you!

Hello!

Looks like your samples were sequenced with extremely high sequencing depth.

That gives ~100 Mb per sample (usually for 16S I get 1-8 Mb with sufficient sequencing depth). I guess you have billions reads for each, so dada2 takes long time to process it. Are you running everything together?

In case if you decide to abort the current run, you can try:

  • Split your dataset into batches (16S separately, CO1 separately)
  • Split further each subset, especially if you have samples from different sequencing runs
  • Or, you may also consider subsampling your samples to the fraction of the total depth in a sample.

If you decide to split dataset into subsets, make sure that you are running the subset that should be merged later with identical parameters during primer removal / dada2 steps, otherwise there will be artificial clustering of the samples based on different trimming/truncation settings. In that way you can launch each subset in parallel on the HPC, then merge the tables and rep-seq files.

Hope that helps.

1 Like

It is a 16s sequencing data of CO1 marker which has 142 samples for these I’m running it together. The data is not from different sequencing runs.

1 Like

Then I would either:

  • wait until it finish
  • or split them into batches and run dada2 with identical settings to merge later
  • or subsample samples to a fraction of the total number of reads in the sample

can you suggest how i can do it. since my data is not from different sequencing run what could be the ideal method to split the data into batches

I am not familiar with your experimental design. Ideally, it would to separate them by the groups that you are not going to compare. For example, I did something similar with a very large dataset - I had several groups with treatment and several groups based on the GIT section or body site. Since it was already well known that GIT sections are very different from each other, I split my dataset by the section and compared treatment groups for each section separately.

If it is not possible and you have only treatment groups and all samples should be compared together, I would make sure that each group is equally (max 1 sample difference) represented in all batches. Keep batch info in the metadata to check if you get a strong batch effect after splitting.

Best,

Thank you so much for the reply. Will try this.