High but variable sequence loss (32-68%) during DADA2 filtering with V3-V4 amplicons (MiSeq 600 cycles)

Hi QIIME2 community,

I am processing 16S rRNA data (V3-V4 region, ~464 bp amplicon) sequenced on an Illumina MiSeq using the V3 600-cycle kit (2x276 bp reads). I am seeing a significant and variable drop in the number of sequences during the DADA2 filtering step, and I would like to discuss if this suggests a taxonomic bias or if it is acceptable given my downstream results.

My Pipeline:

  1. Primer removal (Cutadapt): --p-front-f CCTACGGGNGGCWGCAG --p-front-r GACTACHVGGGTATCTAATCC --p-discard-untrimmed

  2. DADA2 (denoise-paired): --p-trunc-len-f 0 --p-trunc-len-r 201

The Data (see attached table):

  • Filtering Loss: Across 86 samples, the "percentage of input passed filter" varies significantly, ranging from 32% to 58%.

  • Merging Success: Once reads pass the filter, the merging efficiency is very high. For example, in Sample 34: 68,588 (filtered) 68,322 (denoised) 68,000 (merged). This indicates that the ~13-20 bp overlap is technically functional for the sequences that survive.

  • Final Throughput: Most samples result in 40,000 to 80,000 non-chimeric reads, which is quite robust.

  • Rarefaction: Alpha rarefaction curves (SRS) for all samples reach a clear plateau, even for those with 65% loss.

My Concerns:

  1. Taxonomic Bias: Since V3-V4 amplicons can vary in length (approx. 440-480 bp), am I systematically excluding longer variants by truncating the Reverse read at 201 bp? Could the variation in filtering success (32% vs 58%) between samples be related to their specific bacterial composition rather than just sequencing quality?

  2. The discard-untrimmed impact: My demux summarize shows a minimum sequence length of 164 bp for Reverse reads after Cutadapt. Since I set --p-trunc-len-r 201, is it likely that the interaction between Cutadapt trimming and the DADA2 length requirement is the main driver of this 65% loss?

  3. Methodological Validity: Given that my rarefaction curves plateau and I have a high final read count, is it standard practice to accept this level of data loss, or should I prioritize recovering more reads by loosening the Cutadapt parameters or increasing --p-max-ee-r?

I have attached my DADA2 stats and the sequence length summary for further context. I would appreciate any guidance on whether to "trust" these filtered results or re-optimize the pipeline.

Cutadapt
qiime cutadapt trim-paired \

--i-demultiplexed-sequences qza/demux-paired-end.qza \

--p-front-f "CCTACGGGNGGCWGCAG" \

--p-front-r "GACTACHVGGGTATCTAATCC" \

--p-discard-untrimmed \

--o-trimmed-sequences qza/demux-paired-end_noprimers.qza

DADA2:

qiime dada2 denoise-paired \

--i-demultiplexed-seqs qza/demux-paired-end_noprimers.qza \

--p-trunc-len-f 0 \

--p-trunc-len-r 201 \

--o-representative-sequences qza/Seq16_dada_seqs.qza \

--o-table qza/Seq16_dada_table.qza \

--o-denoising-stats qza/Seq16_dada_stats.qza

Best regards, Lilian

demux-paired-end_noprimers.qzv (329.7 KB)

Seq13_dada_stats.qzv (1.2 MB)

Hi @Lilian_Tiemi_Inoue ,

No, probbaly not. Such taxonomic bias would show up at the merging step, not at the filtering step. Your merging is looking really good! You are losing many reads at the filtering stage, but more on that below.

Besides your reads seem just long enough — 201 + 276 should yield sufficiently long reads to overlap and cover your 464 bp amplicon (assuming that 464 bp is not the expected length after trimming primers). This issue of merging bias is more common with amplicon targets that have much wider length variation.

I am not completely sure why you are running cutadapt at all. Your reads are not long enough to cover the full V3-V4 domain, so you should not be getting read-through issues (where the read covers the entire domain and so the reverse primer is found in the read). You can just use positional trimming with dada2 to remove the primers (as they will be the first N nt of each read, unless if you have an unusual library prep method).

Don't touch max-ee unless if you want to permit more low-quality reads. You have very high sequence depths post-filtering so this should be sufficient.

But now to answer why you are you losing so many reads at filtering: It is from this parameter:

Since you are doing no truncation, you are inputting quite a lot of noisy reads to dada2 (median Q = 21 at position 260); as typical, the 3' end of the reads can be quite noisy and should be trimmed if possible. Of course, you need all the length you can get to afford sufficient overlap for merging. This is probably okay since your sequencing depth is so high and the sequencing error should be random (in theory) so this would not introduce a systematic bias (unlike trying to merge reads that are not quite long enough and will systematically lose longer amplicons).

In any case, as a sanity check you could try this: denoise only the forward reads (trimming a bit to remove the low-quality 3' ends) and compare taxonomic profiles from single-end and paired-end feature tables. The paired-end data should have slightly better resolution, but you could compare taxonomic profiles at, e.g., family or genus level to see if there are any taxonomic groups that appear more abundant with one data type or the other. I suspect that any differences will be related to higher taxonomic resolution of the merged paired-end data, so e.g., you will achieve genus instead of family or species instead of genus, but you will not detect entirely new clades.

So in summary I think everything you are doing is fine, though you could probably skip the cutadapt step and trim with dada2 instead (if your primers are actually present in your reads).

Good luck!

1 Like