DADA2 denoising taking long time

Ruba_Devi_S · November 27, 2025, 7:10am

Issue:

I'm experiencing an extremely long runtime with the denoising (dada2) step in my QIIME2 pipeline. The process has been running for 10 days now with no completion in sight, and I'm trying to figure out what might be going wrong.

QIIME2 Version: 2024.10.1

Dataset Information

Number of samples: 152 [16s - CO1]
Read type: [paired-end]
Demux file size: [14.9GB]

Command Used

#!/bin/bash

#PBS -N denoise_co1-RP

#PBS -q cpu-q

#PBS -l nodes=1:ppn=30

#PBS -l mem=120gb

#PBS -V

#PBS -j oe

#PBS -o /pfs/home/ruba/projects/qiime2-amplicon/animal_diet/co1/dada2_denoising/trial1/denoising.log

# Truncation settings (EDIT BASED ON qzv)

TRUNC_LEN_F=180 # Truncate forward reads at this position

TRUNC_LEN_R=225 # Truncate reverse reads at this position

echo "Using trunc-len-f: $TRUNC_LEN_F, trunc-len-r: $TRUNC_LEN_R" >> "$LOGFILE"

# Run DADA2

qiime dada2 denoise-paired \

--i-demultiplexed-seqs "$INPUT_QZA" \

--p-trunc-len-f $TRUNC_LEN_F \

--p-trunc-len-r $TRUNC_LEN_R \

--p-n-threads 0 \

--o-table "$OUTPUT_DIR"/table.qza \

--o-representative-sequences "$OUTPUT_DIR"/rep-seqs.qza \

--o-denoising-stats "$OUTPUT_DIR"/stats.qza \

--verbose >> "$LOGFILE" 2>&1

Current System Status (top command) :

Log File Contents:

DADA2 started at: Mon Nov 17 11:10:12 IST 2025
Using trunc-len-f: 180, trunc-len-r: 225
R version 4.3.3 (2024-02-29)
Loading required package: Rcpp
DADA2: 1.30.0 / Rcpp: 1.0.13.1 / RcppParallel: 5.1.9
2) Filtering ............................................................................
3) Learning Error Rates
347333760 total bases in 1929632 reads from 2 samples will be used for learning the error rates.
434167200 total bases in 1929632 reads from 2 samples will be used for learning the error rates.
3) Denoise samples ............................................................................
............................................................................
5) Remove chimeras (method = consensus)

Could anyone help me to figure out what and where things are going wrong and how to solve this problem. Any guidance would be greatly appreciated. Thank you!

timanix · November 27, 2025, 7:31am

Hello!

Looks like your samples were sequenced with extremely high sequencing depth.

That gives ~100 Mb per sample (usually for 16S I get 1-8 Mb with sufficient sequencing depth). I guess you have billions reads for each, so dada2 takes long time to process it. Are you running everything together?

In case if you decide to abort the current run, you can try:

Split your dataset into batches (16S separately, CO1 separately)
Split further each subset, especially if you have samples from different sequencing runs
Or, you may also consider subsampling your samples to the fraction of the total depth in a sample.

If you decide to split dataset into subsets, make sure that you are running the subset that should be merged later with identical parameters during primer removal / dada2 steps, otherwise there will be artificial clustering of the samples based on different trimming/truncation settings. In that way you can launch each subset in parallel on the HPC, then merge the tables and rep-seq files.

Hope that helps.

Ruba_Devi_S · November 27, 2025, 6:11pm

It is a 16s sequencing data of CO1 marker which has 142 samples for these I’m running it together. The data is not from different sequencing runs.

timanix · November 28, 2025, 12:04pm

Then I would either:

wait until it finish
or split them into batches and run dada2 with identical settings to merge later
or subsample samples to a fraction of the total number of reads in the sample

Ruba_Devi_S · December 1, 2025, 7:34am

can you suggest how i can do it. since my data is not from different sequencing run what could be the ideal method to split the data into batches

timanix · December 1, 2025, 7:47am

I am not familiar with your experimental design. Ideally, it would to separate them by the groups that you are not going to compare. For example, I did something similar with a very large dataset - I had several groups with treatment and several groups based on the GIT section or body site. Since it was already well known that GIT sections are very different from each other, I split my dataset by the section and compared treatment groups for each section separately.

If it is not possible and you have only treatment groups and all samples should be compared together, I would make sure that each group is equally (max 1 sample difference) represented in all batches. Keep batch info in the metadata to check if you get a strong batch effect after splitting.

Best,

Ruba_Devi_S · December 1, 2025, 10:53am

Thank you so much for the reply. Will try this.

system · January 1, 2026, 4:53pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.