I end up with only 3% of my data after denoising with dada2

Silke_Lambert · February 14, 2025, 4:29pm

Hi all,

I have human nose swab samples. My samples were pooled before sequencing (16S and 18S together). My data are fastq.gz samples (R1 and R2) for each of my samples.

I do the following:

remove my adapters (using cutadapt)
split my data in 16S and 18S based on the primers and remove the primers (with cutadapt)
using a manifest file I create .qza files with my data
following I look at my quality and decide to trunc. I only trunc my reverse reads of the 16S as the others look ok (see picture).
When I look at my denoising statistics, I see that I only keep around 3% of my input and I don't understand why (see picture). Can someone help me?

Thanks a lot!

ebolyen · February 14, 2025, 4:31pm

Hi @Silke_Lambert,

Your data looks great, so I'm not sure why this is happening either.

It looks like the vast majority of the dropout happens at the filtering step, which is odd since your reads are plenty long and have good quality.

Could you provide your specific DADA2 commands for each amplicon?

Thanks!

Silke_Lambert · February 14, 2025, 5:00pm

adapter removing:

FWD_ADAPTER="TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG"
REV_ADAPTER="GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG"

# Zoek alle forward reads (_R1.fastq.gz) in RUN205 en submappen
find "$INPUT_DIR" -type f -name "*_R1.fastq.gz" | while read FORWARD; do
    # Vind het bijbehorende reverse read-bestand (_R2.fastq.gz)
    REVERSE="${FORWARD/_R1.fastq.gz/_R2.fastq.gz}"

    # Controleer of het reverse bestand bestaat
    if [ -f "$REVERSE" ]; then
        BASENAME=$(basename "$FORWARD" _R1.fastq.gz)

        echo "Verwerken: $BASENAME"

        # Cutadapt uitvoeren voor adapter trimming
        cutadapt \
          -a "$FWD_ADAPTER" -A "$REV_ADAPTER" \
          -o "$OUTPUT_DIR/${BASENAME}_trimmed_R1.fastq.gz" \
          -p "$OUTPUT_DIR/${BASENAME}_trimmed_R2.fastq.gz" \
          "$FORWARD" "$REVERSE"

        echo "Klaar: $BASENAME"

    else
        echo "Geen bijbehorende R2 gevonden voor $FORWARD! Overslaan..."
    fi
done

splitting:

find "$INPUT_DIR" -type f -name "*R1.fastq.gz" | while read FORWARD; do
    # Vind het corresponderende reverse bestand
    REVERSE="${FORWARD/R1.fastq.gz/R2.fastq.gz}"

    # Controleer of het reverse bestand bestaat
    if [ -f "$REVERSE" ]; then
        # Extracteer de basisnaam zonder subpad
        BASENAME=$(basename "$FORWARD" _R1.fastq.gz | sed 's/_trimmed//')

        echo "Processing $BASENAME..."

        # Run Cutadapt voor 16S
        cutadapt \
          -g "CCTACGGGNGGCWGCAG" \
          -G "GACTACHVGGGTATCTAATCC" \
          -o "$OUTPUT_DIR/${BASENAME}_16S_R1.fastq.gz" \
          -p "$OUTPUT_DIR/${BASENAME}_16S_R2.fastq.gz" \
          "$FORWARD" "$REVERSE"

        # Run Cutadapt voor 18S
        cutadapt \
          -g "CAGCAGCCGCGGTAATTCC" \
          -G "CCCGTGTTGAGTCAAATTAAGC" \
          -o "$OUTPUT_DIR/${BASENAME}_18S_R1.fastq.gz" \
          -p "$OUTPUT_DIR/${BASENAME}_18S_R2.fastq.gz" \
          "$FORWARD" "$REVERSE"
    else
        echo "Waarschuwing: Geen bijbehorende R2 voor $FORWARD gevonden!"
    fi
done

create .qza file:
qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path manifest_16S.tsv
--output-path paired-end-demux-16S.qza
--input-format PairedEndFastqManifestPhred33V2

Plaat 2: Sequenties importeren

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path manifest_18S.tsv
--output-path paired-end-demux-18S.qza
--input-format PairedEndFastqManifestPhred33V2

look at the quality:
qiime demux summarize
--i-data paired-end-demux-16S.qza
--o-visualization demux_16S.qzv

qiime demux summarize
--i-data paired-end-demux-18S.qza
--o-visualization demux_18S.qzv

denoising:

Denoising usando DADA2

qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux-16S.qza
--p-trunc-len-f 0
--p-trim-left-f 0
--p-trunc-len-r 280
--p-trim-left-r 0
--o-table dada2_16S_table.qza
--o-representative-sequences dada2_rep-seqs16S.qza
--o-denoising-stats dada2_stats16S.qza

and for the 18S:

Denoising usando DADA2

qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux-18S.qza
--p-trunc-len-f 0
--p-trim-left-f 0
--p-trunc-len-r 0
--p-trim-left-r 0
--o-table dada2_18S_table.qza
--o-representative-sequences dada2_rep-seqs18S.qza
--o-denoising-stats dada2_stats18S.qza

visualize:
#Denoising stats visualization
qiime metadata tabulate
--m-input-file dada2_stats18S.qza
--o-visualization dada2_stats18S.qzv

ebolyen · February 14, 2025, 5:09pm

Hey @Silke_Lambert,

This is basically exactly what I would have done given those quality plots, so at this point all I can recommend is setting your trunc-len a bit more aggressively to see if we can get through filtering.

You might try 280 on all of them just to see? I don't love this because again, your quality scores look great.

Otherwise, we need to consider tuning parameters like max-ee.

timanix · February 15, 2025, 5:18pm

Hello!
I had a dataset with similar issue and similar quality plots (the same drop in the middle). So the workarounds were for me (choose any that works):

Increase max-ee values as @ebolyen suggested
Merge reads with vsearch and use Deblur
Cluster reads into OTUs (100%, 99% or 97%) with vsearch.

benjjneb · February 15, 2025, 6:49pm

My guess is that you are losing the vast majority of your 16S reads due to the interaction between the sequence length after trimming the primers and your reverse truncation length. It looks like you have 300bp sequences, and your reverse 16S primer is 21 nts long, so if it present on the reads and being trimmed off by cutadapt, then the output reads are 279 bps, and so will get thrown away when you set --p-trunc-len-r 280. The truncation length in dada2 both truncates longer reads to the specified length, and throws away all reads that are shorter than the specified length.

Are you alse seeing this in the 18S data, where you aren't truncating?

Silke_Lambert · February 18, 2025, 2:02pm

Thank you, I indeed put zero for my truncing now and I got the following (see picture) for my 16S . I can try as well with 278 instead of 280.

For the 18S, I did the following and I only keep around 12% of my data. I will now also run it with truncing put at zero.

Denoising usando DADA2

qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux-18S.qza
--p-trunc-len-f 0
--p-trim-left-f 0
--p-trunc-len-r 277
--p-trim-left-r 0
--o-table dada2_18S_table.qza
--o-representative-sequences dada2_rep-seqs18S.qza
--o-denoising-stats dada2_stats18S.qza

system · March 21, 2025, 8:02pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.