DADA2 denoising filtering/merging step removing 60-70% of reads

emmlemore · April 30, 2025, 11:41am

Hello,

I'm processing environmental 16S samples using V3V4 primers (341F/805R) I am losing a majority of my reads at the denoising step. I know this question has been asked before, and it seemed like the solution was to increase the number of overlapping base pairs, which is what I did, but that led to more reads being lost.

Here is the demux file after primer trimming.

Code for primer trimming:

qiime cutadapt trim-paired 
--i-demultiplexed-sequences /qza/v3v4.qza 
--p-cores 8 
--p-front-f CCTACGGGNGGCWGCAG ###341F 
--p-front-r GACTACHVGGGTATCTAATCC ###805R 
--p-discard-untrimmed 
--p-no-indels 
--o-trimmed-sequences /qza/v3v4_trimmed.qza

v3v4_trimmed.qzv (326.2 KB)

Code for DADA2 including truncation step:

###v3v4 primers: 805-341=464 amplicon; fwd trunc @ 261; rvs trunc @ 220; (261+220)-464 = 17 bp overlap
qiime dada2 denoise-paired 
--i-demultiplexed-seqs /qza/v3v4_trimmed.qza 
--p-trunc-len-f 261 
--p-trunc-len-r 220 
--o-table v3v4_table.qza 
--o-representative-sequences v3v4_rep_seqs.qza 
--o-denoising-stats v3v4_denoising_stats.qza

In my code, I truncated at the base pairs I've chosen because the quality of the reads towards the 3' ends do decrease significantly, but still allowing more than the default 12 bp overlap.

Here is the denoising stats file. I am losing 60-80% of my reads using the parameters in my code.
v3v4_denoising_stats.qzv (1.2 MB)

I tried again using less conservative truncation parameters to allow for more overlap as I thought that was the issue, but results ended up being worse, and I lose even more reads (70-80%).

New DADA2 code:

###v3v4 primers: 805-341=464 amplicon; fwd trunc @ 283; rvs trunc @ 230; (283+230)-464 = 49 bp overlap
qiime dada2 denoise-paired 
--i-demultiplexed-seqs /qza/v3v4_trimmed.qza 
--p-trunc-len-f 283 
--p-trunc-len-r 230 
--o-table v3v4_table2.qza 
--o-representative-sequences v3v4_rep_seqs2.qza 
--o-denoising-stats v3v4_denoising_stats2.qza

And the new denoising stats file:
v3v4_denoising_stats2.qzv (1.2 MB)

Thank you.

timanix · April 30, 2025, 12:17pm

Hello!
First of all, thank you for providing all the details and structuring your post. It saves a lot of time!

Based on your qzv files, your samples collected most of the issues that could decrease the number of output sequences:

filtering (low scores)
chimeras
merging

Considering that you already removed the primers and discarded sequences without them, I can suggest:

using 260 and 230 trunc. values for F and R reads
decreasing minimum overlap to 6
increasing max_ee values for F and R
setting --p-min-fold-parent-over-abundance to, for example, 8 (not more than 16)

and see how it affects your stats.

Best,

emmlemore · May 1, 2025, 11:20am

Hi Timur,

Thank you for your response and your tips. I tried the following code adding in your suggested parameters, and results look slightly better, but still 60-80% of reads are being filtered out.

qiime dada2 denoise-paired 
--i-demultiplexed-seqs /qza/v3v4_trimmed.qza 
--p-trunc-len-f 260 
--p-trunc-len-r 230 
--p-min-overlap 8   ###increased to 8 from suggested 6
--p-max-ee-f 4 
--p-max-ee-r 4 
--p-min-fold-parent-over-abundance 8 
--o-table v3v4_table3.qza 
--o-representative-sequences v3v4_rep_seqs3.qza 
--o-denoising-stats v3v4_denoising_stats3.qza

Here is the demux file.
v3v4_denoising_stats3.qzv (1.2 MB)

This would be my third time running it. I'm wondering if it's just because the reads are not of good quality and if I should proceed with the limited number of reads?

Thank you.

timanix · May 1, 2025, 11:46am

Hi,
Now I see that more reads are getting to the merging, but still a lot of reads are not passing this step.

It is one of the reasons I don't like working with V3-V4 region! It is often like that - the reads are either too short to overlap or of a bad quality.

I would play a little bit more with the truncation parameters and max-ee since, if I remember correctly, V3-V4 region clusters into three groups based on its length, so you are risking to lose the group with the longest region by truncating too low.

So you can try some more combinations, save the output to the folders with different names and then choose one that gives the best output.

Best,

John_Quensen · May 4, 2025, 4:43pm

The variable length of the amplicon region presents a problem that you cannot solve with the DADA2 plugin because there is an important parameter for the mergePairs function that the DADA2 plugin does not expose. By default the maxMismatch parameter (the number of mismatches allowed between the forward and reverse reads in the overlap region) is set to 0 and you cannot change it. For shorter amplicons in the V3/V4 region the overlap can be over 70 bp and allowing no mis-matches is simply too strict. This is why we lose so many reads at the merge step.
I have requested that this parameter be exposed in future editions of the QIIME2 plugin. In the meantime, retention can be improved by using the R version of DADA2. See Processing Psomagen Sequences - John Quensen and Psomagen Exercise Answers - John Quensen for an exercise addressing just this problem.