Hi QIIME2 community,
I am processing 16S rRNA data (V3-V4 region, ~464 bp amplicon) sequenced on an Illumina MiSeq using the V3 600-cycle kit (2x276 bp reads). I am seeing a significant and variable drop in the number of sequences during the DADA2 filtering step, and I would like to discuss if this suggests a taxonomic bias or if it is acceptable given my downstream results.
My Pipeline:
-
Primer removal (Cutadapt):
--p-front-f CCTACGGGNGGCWGCAG --p-front-r GACTACHVGGGTATCTAATCC --p-discard-untrimmed -
DADA2 (denoise-paired):
--p-trunc-len-f 0 --p-trunc-len-r 201
The Data (see attached table):
-
Filtering Loss: Across 86 samples, the "percentage of input passed filter" varies significantly, ranging from 32% to 58%.
-
Merging Success: Once reads pass the filter, the merging efficiency is very high. For example, in Sample 34: 68,588 (filtered) 68,322 (denoised) 68,000 (merged). This indicates that the ~13-20 bp overlap is technically functional for the sequences that survive.
-
Final Throughput: Most samples result in 40,000 to 80,000 non-chimeric reads, which is quite robust.
-
Rarefaction: Alpha rarefaction curves (SRS) for all samples reach a clear plateau, even for those with 65% loss.
My Concerns:
-
Taxonomic Bias: Since V3-V4 amplicons can vary in length (approx. 440-480 bp), am I systematically excluding longer variants by truncating the Reverse read at 201 bp? Could the variation in filtering success (32% vs 58%) between samples be related to their specific bacterial composition rather than just sequencing quality?
-
The
discard-untrimmedimpact: Mydemux summarizeshows a minimum sequence length of 164 bp for Reverse reads after Cutadapt. Since I set--p-trunc-len-r 201, is it likely that the interaction between Cutadapt trimming and the DADA2 length requirement is the main driver of this 65% loss? -
Methodological Validity: Given that my rarefaction curves plateau and I have a high final read count, is it standard practice to accept this level of data loss, or should I prioritize recovering more reads by loosening the Cutadapt parameters or increasing
--p-max-ee-r?
I have attached my DADA2 stats and the sequence length summary for further context. I would appreciate any guidance on whether to "trust" these filtered results or re-optimize the pipeline.
Cutadapt
qiime cutadapt trim-paired \
--i-demultiplexed-sequences qza/demux-paired-end.qza \
--p-front-f "CCTACGGGNGGCWGCAG" \
--p-front-r "GACTACHVGGGTATCTAATCC" \
--p-discard-untrimmed \
--o-trimmed-sequences qza/demux-paired-end_noprimers.qza
DADA2:
qiime dada2 denoise-paired \
--i-demultiplexed-seqs qza/demux-paired-end_noprimers.qza \
--p-trunc-len-f 0 \
--p-trunc-len-r 201 \
--o-representative-sequences qza/Seq16_dada_seqs.qza \
--o-table qza/Seq16_dada_table.qza \
--o-denoising-stats qza/Seq16_dada_stats.qza
Best regards, Lilian
demux-paired-end_noprimers.qzv (329.7 KB)
Seq13_dada_stats.qzv (1.2 MB)