Filtering, truncating and trimming reads in dada2


I've just started recently using QIIME2. I'd just like a reality check for myself regards the denoising step in dada2. In total I have 96 samples and overall I think the quality of the reads look pretty good. They are 300PE reads and we have amplified the V3-V4 region. Please see below.

demux-paired.qzv (323.4 KB)

I am just playing around with the truncation and trimming options in dada2 and am trying to figure out the optimum. I've read a few other threads on the forum and based on these have played around with truncation lengths and trimming and have viewed (in a very non-scientific way) the impact these have on the number of reads that pass the initial filter and are subsequently labelled as Features.

I initially set a truncation length of 270 for both forward and reverse reads

qiime dada2 denoise-paired --i-demultiplexed-seqs demux-paired.qza --p-trunc-len-f 270 --p-trunc-len-r 270 --p-n-threads 16 --output-dir dada2out_270_270.

The resultant files are linked below

stats-dada2.qzv (1.2 MB)
table-dada2.qzv (546.0 KB)

I have played around with alternative truncation lengths but in general I think that the best results I see are when I don't truncate at all. Again I had read some posts that recommended as long as the median score for a particular position had a Phred score >30 it was recommended to retain that position.

Results below

qiime dada2 denoise-paired --i-demultiplexed-seqs trimmed_exact.qza --p-trunc-len-r 0 --p-trunc-len-f 0 --output-dir dada2out

stats-dada2.qzv (1.2 MB)
table-dada2.qzv (560.1 KB)

I know this is an incredibly general and subjective question but based on the number of reads that passed and the resultant Feature table generated would I be right in assuming that I can proceed with the denoising step that does not truncate reads? I have also played around with overlap values and am not seeing major differences

I guess the thing that has surprised me is that we also have ITS data for this data and the number of reads that pass the dada2 filter and result in Featues is much higher than what we observe for the 16S data (see below). We used ITSxpress to trim reads before passing into dada2. I guess this is just a difference in the length of the amplicons?

ITS_demux-paired.qzv (327.9 KB)
ITS_stats-dada2.qzv (1.2 MB)
ITS_table-dada2.qzv (656.0 KB)

Thanks in advance

1 Like

Hello David,

Welcome to the forums! :qiime2:

This is a great post! You are on the right track and asking all the right questions.

I've pulled out your two dada2 run, so I can compare them side by side.
Both tables are sorted by lowest percent of input merged.

You are losing most reads, like 90% (!!) in the quality filter step.

Here, most reads pass the filter, but fewer are able to join.

Ideally, there would be a happy medium, in which most reads could pass filter and still join. But because the V3-V4 region is so long, I think this may be the best we can do.

Having 3k reads per sample is still pretty good!

Yes. While it depends on the exact primers, V3-V4 is so long if often maxes out the Illumina platform while still being able to pair.
And DADA2 needs reads to pair. :person_shrugging:

These results look good! I'm glad Qiime2 is working for you.

Let us know if you have more questions!

1 Like

Hi Colin

Thanks for the quick reply, much appreciated. I just wanted to make sure I wasn't doing something unusual before proceeding. I probably will at this point but I am kind of surprised that such a high proportion of the reads can't be merged as I assumed the V3-V4 regions to be ~460 nucleotides. After trimming I think the majority of our reads are 280nuc so I calculated a rough overlap of ~ 280F + 280R - 460 (amplicon) ~ 120nucleotides.

I think what is confusing the issue is that the sequencer supplier provided Tags along with Raw and Clean reads. The number of Tags per sample is very high 60K+ and the average length of these tags is ~416 nucs. They stated that If paired-end reads overlap with each other, then a consensus sequence will be generated using Fast Length Adjustment of Short reads, (FLASH v1.2.11). Minimum overlapping length:15 bp and mismatch ratio of overlapped region:<= 0.1.

I was thinking of importing these tags are SE reads and then running them through dada2. to compare the number of Features at the end. Not sure if this is a sensible thing to do?


1 Like

Good morning David,

Or 270+270-460 = 80 overlap after trimming. Still plenty of overlap, if that 460 amplicon length is correct...

This implies that the 460 amplicon length is long, and maybe the real length is more like ~420, meaning we have too much overlap and more trimming is needed.

Try this:
--p-trunc-len-f 260 --p-trunc-len-r 220

(I'm aiming for 20 bp of overlap, and removing the low quality end of R2)

One last thing:

DADA2 processes single and paired ends differently, so this is not advised. If we can get it working from raw reads, we should!

1 Like

Thanks for this suggestion. Have to say this one has me scratching my head. I assumed you could never have too much overlap :smile: . In an ideal wold if one has paired end reads with every position receiving a large Phred score of >35 (for example) I'm guessing this wouldn't be an issue?

I followed your suggestion and tried an array of different truncation lengths (I won't bore you with all of them). As a rough rule of thumb, I kept an eye of the samples receiving with largest smallest number of features (CF12 & K17) respectively as well as the overall number of features merged in the 96 samples.

with no truncation the numbers were CF12=40317, K17=3927, overall 2,303,347
stats-dada2_0_0_t10.qzv (1.2 MB)

following your suggestion of --p-trunc-len-f 260 --p-trunc-len-r 220
CF12=35206, K17=12105, overall 1,931,093 so a large increase in the CF12 sample
stats-dada2_260_200.qzv (1.2 MB)

I played around a little with --p-trunc-len-f 270 --p-trunc-len-r 180
CF12=39023, K17=13926, overall 2,272,248
stats-dada2_270_180.qzv (1.2 MB)

So take home message truncating the reverse reads had the biggest impact.Thanks for the suggestion again and I'm happy to proceed with these denoised reads at this point as I think they have improved my initial merging strategy.


1 Like

Ah good question!

For the DADA2 method, truncating is done first, then reads are removed if their cumulative expected error is too high. This means that shorter reads have fewer expected errors, so more will pass filter.

This is counterintuitive. In other programs, more overlap is more good. But not for DADA2.

Great! This is expected.

Feel free to open another thread if you have more Qiime2 questions :qiime2:

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.