Best way to denoise data


I have trouble deciding how to do the denoise step with dada2 when I have a quality plot like this (attached).
Previously I have had clear idea what to do because the quality is low in the 5' and 3' ends but now I can see quality drop in the middle. What would be the best way to go from here? I know that the denoise-paired has the --p-trunc-q option, should I trim and truncate the reads as always and then add the --p-trunc-q maybe?
I am using qiime2 version 2019.7 that is installed into my workplace server.
Currently I did this command, but I think I should do something with the middle part aswell where the quality drops quite a bit
qiime dada2 denoise-paired --i-demultiplexed-seqs toitumisuuring-influencerid-paired.qza --p-n-threads 20 --p-trunc-len-f 252 --p-trunc-len-r 219 --p-trim-left-f 15 --p-trim-left-r 26 --o-representative-sequences rep-seqs-toitumisuuring-influencerid-paired.qza --o-table table-toitumisuuring-influencerid-paired.qza --o-denoising-stats stats-toitumisuuring-influencerid-paired.qza --verbose --p-chimera-method consensus

I would like to be able to use --p-trunc-len-r 240 instead of 219 so that I wouldn't loose that data but the quality is so wonky.

Hi @kreetelyll,
What region is this and what primer sets are you using? We’re particularly interested in how much overlap region there are.
Can you also post the stats results visualization of your dada2 run so we know how this run actually performed?
It’s hard to make recommendations without this information.

I would for now stay away from playing around with the q-score filtering, instead try trimming the first 30 or so sequences from the 5’ and see if that helps.

So what happens if you do run those parameters?


It is V3-V4 region. I believe the primers used were 515F– 806R.
I added the stats visualization.

I haven't tried it out.

I know that two different kits were used for DNA extraction, can that play a big role and maybe I should do 2 separate denoising depending on which extraction method was used?

stats.qzv (1.2 MB)

Hi @kreetelyll,
Thanks for the updated info.
From the provenance of the stats.qzv I can extract that you ran the following parameters for DADA2:


Note, that this is different than what you initially mentioned above, not sure if this is intentional or not but just thought I’d point it out.
And the problem so far in your stats results is that your reads are unable to merge. This is because you are truncating too much from the 3’. With a long region like V3-V4 plus the fact that you only have 2x250 bp reads that doesn’t leave you much room for truncating. You have 2 options, either a) re-run but change the truncating parameters to a point where your read still merge (search the forum for how to do this calculation, lots of threads discussing how to calculate overlap region and setting dada2 trimming parameters) or b) discard the reverse reads all together and just use the forward reads.
I wouldn’t worry about the middle regions of questionably quality, they seem to have pass the initial dada2 filtering process anyways (see filtered column in stats.qzv).

Hope this helps!

1 Like

Sorry, yes I sent the wrong file. This one is correct. stats.qzv (1.2 MB) . I used:
--p-trunc-len-f 0
--p-trunc-len-r 219
--p-trim-left-f 15
--p-trim-left-r 26
What do you think is it okay to continue on with these data or should I change the setting/use only forward reads? I am not completely sure how to read the stats file and what is the most important information I should get out of it when deciding my next steps. Would appreciate an opinion of a pro.

Hi @kreetelyll,

The stats file you view shows chronologically what happens in each step of DADA2.
Input: How many reads were at the beginning
Filtered: # of reads retained after Initial filtering, trimming etc.
denoised: # of reads that were successfully denoised and retaind
merged: # of reads that were successfully merged after denoising
non-chimeric: # of reads retained after chimera removes <- this is the final # you should see in your feature-table summarize results.

So technically you have enough reads to move forward in your analyses. However, you are losing more reads during your merging step than we usually see. Given that this is the long V3-V4 region and you are only dealing with 2 x 250 nt reads, I would be a bit concerned here that the reads that are failing to merge represent naturally longer taxa. That would mean that you could be introducing a real biological bias in your data by excluding those longer taxa. If this was me, I would probably discard the reverse reads and only focus on either the forward or reverse read to avoid introducing the merging-derived bias.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.