Dada2 denoise - why do I have so many reads filtered out

Jo_mee · February 26, 2019, 8:05am

Hi everyone,

I am blocked on my quality filtering part of analysis and I need some help.
I used dada2 denoising on my microbiome analysis and it seems there are many reads filtered out on the way. In my opinion too many. I have searched through forum and tried to refer what was posted to my case.
https://forum.qiime2.org/t/dada2-denoise-paired-removing-most-sequences-during-filtering/7159
https://forum.qiime2.org/t/dada2-denoise-paired-result-90-loss-in-reads/619
I worry it affects microbial community.

Here is what I did:
Sequencing 2x250, V3-V4 region, Illumina Miseq
I followed moving pictures tutorial.

Commands:

qiime dada2 denoise-paired
--i-demultiplexed-seqs FastQ-demux.qza
--p-trunc-q 20 //I did the same denoising part having only 10 value here, for checking what went wrong
--p-trim-left-f 8
--p-trim-left-r 8
--p-trunc-len-f 240
--p-trunc-len-r 240
--p-n-threads 0
--p-chimera-method consensus
--o-representative-sequences rep-seqs-dada2-chill10.qza
--o-table table-dada2-chill10.qza
--o-denoising-stats stats-dada2_chill10.qza

FastQ-LB18_31-demux.qzv (290.6 KB)

--p-trunc-q 20
stats-dada2_LB18_31.qzv (1.2 MB)
--p-trunc-q 10
stats-dada2_LB18_31-chill10.qzv (1.2 MB)

Are my parameters too strict?
In my opinion problem may be merging (based on --p-trunc-q 10) but I dont know how to find the reason for it.
Based on --p-trunc-q 20 results too many read are filtered out in the first instance.
It is pretty confusing.

Regards,
Joanna

jnesme · February 26, 2019, 1:09pm

The quality set for truncation will cut everything after the first instance of a base with such quality. With your settings it means that each read is cut at the first base below Q20, which is highly probable in most reads before you'd reach your defined threshold for truncation length, so indeed you'll have very few reads remaining after filtering. I hope I was clear.

Jo_mee · February 26, 2019, 1:28pm

Thank for the answer.
Does it mean that in my case for such a demux.qzv file I should not use this flag?
When I changed it to 10, indeed I got more reads left, yet there merging and chimera search part removed then quite a lot anyway.

amm59063 · February 27, 2019, 3:22pm

Hi @Jo_mee,

I'm having the same problem. My data generated from PE300. Then I tried to follow the Fecal microbiota transplant (FMT) study tutorial and after the de-noising part I got more reads. I'm not sure that is a correct way.

Afaq

thermokarst · February 27, 2019, 4:54pm

Hi @Jo_mee --- I think that what @jnesme was suggesting above is that your --p-trunc-q 20 value is responsible for filtering a significant portion of your reads. The default value for this parameter is 2 --- what do your results look like when you run with the default?

Jo_mee · March 1, 2019, 9:23am

Dear all,

I did some small research on this topic. I hope I present it clear enough. I have tested different options according to what others said in similar topics. And so it goes (sorry for a long post)

dada2, PE reads, default Phred score = 2

qiime dada2 denoise-paired
--i-demultiplexed-seqs FastQ-LB18_31-demux.qza
--p-trim-left-f 8
--p-trim-left-r 8
--p-trunc-len-f 240
--p-trunc-len-r 240
--p-n-threads 0
--p-chimera-method consensus
--o-representative-sequences rep-seqs-dada2-LB18_31-default.qza
--o-table table-dada2-LB18_31-default.qza
--o-denoising-stats stats-dada2_LB18_31-default.qza

Result: stats-dada2_LB18_31-default.qzv (1.2 MB)

dada2, PE reads, chimera - none

qiime dada2 denoise-paired
--i-demultiplexed-seqs FastQ-LB18_31-demux.qza
--p-trunc-q 20
--p-trunc-len-f 0
--p-trunc-len-r 0
--p-n-threads 0
--p-chimera-method none
--o-representative-sequences rep-seqs-dada2-LB18_31-nochim.qza
--o-table table-dada2-LB18_31-nochim.qza
--o-denoising-stats stats-dada2_LB18_31-nochim.qza

Result: stats-dada2_LB18_31-nochim.qzv (1.2 MB)

dada2, only forward reads

qiime dada2 denoise-single
--i-demultiplexed-seqs FastQ-LB18_31-demux-R1.qza
--p-trim-left 8
--p-trunc-len 240
--p-n-threads 0
--p-chimera-method consensus
--o-representative-sequences rep-seqs-dada2-LB18_31-R1-noQS.qza
--o-table table-dada2-LB18_31-R1-noQS.qza
--o-denoising-stats stats-dada2_LB18_31-R1-noQS.qza

Result: stats-dada2_LB18_31-R1-noQS.qzv (1.2 MB)

Also, I did with another Phred score selection. Chimera filtering stayed 'consensus' in all
--p-trunc-q 20; stats-dada2_LB18_31-R1.qzv (1.2 MB)
--p-trunc-q 10; stats-dada2_LB18_31-R1-10.qzv (1.2 MB)

Additionally, I counter proof all above I used another method for denoising deblur in case of forward reads only

qiime deblur denoise-16S
--i-demultiplexed-seqs FastQ-LB18_31-demux-R1.qza
--p-trim-length 240
--o-representative-sequences rep-seqs-deblur.qza
--o-table table-deblur.qza
--p-sample-stats
--o-stats deblur-stats.qza

deblur-stats.qzv (198.3 KB)

What I noticed is:

what @thermokarst and @jnesme mentioned - it may be one of reasons to decrease the --p-trunc-q value to default version. Yet I always thought that 20 is some kind of a threshold that at first I should check. I dont think yet it is a main reason for such a huge read loss.
In other option I send you here I think it may be a problem with merging of the read so I proceeded with analysis of only forward reads
deblur denoising showed that reads-raw and reads-derep vary a lot. Here, 'click' it was said that if they are not similar it may mean I have singletons. Could you confirm this? All in all the result is similar to forward read only denoising, --p-trunc-q 20

What is the reasonable % of reads that pass good filtering (if the sequencing is good quality offcourse)?

I calculated for all the % of reads that passed denoising. It looks this way:

Top chart is analysis of forward reads R1 in different parameters, bottom chart is forward R1 and reversed R2 reads

thermokarst · March 1, 2019, 8:18pm

If you are truncating your reads at the first instance of Q20, your reads are likely much shorter than they need to be for merging. You are effectively throwing away the nts responsible for allowing merging to happen. Any reads that aren't merged are lost.

Don't forget, DADA2 attempts to correct error, which is probably why the default trunc-q is so low.

Jo_mee · March 4, 2019, 8:52am

I see..but I still have doubts. Please check the results of default setup (Q2)

sample-id	input	filtered	denoised	merged	non-chimeric	% left
#q2:types	numeric	numeric	numeric	numeric	numeric
LB17-16-001	145783	123434	123434	68005	45408	31.15%
LB17-16-002	70462	60081	60081	31014	22621	32.10%
LB17-16-003	119707	96946	96946	73652	30813	25.74%
LB17-16-004	130350	112151	112151	61758	42091	32.29%
LB17-16-005	111168	95150	95150	50107	35143	31.61%
LB17-16-006	122771	99810	99810	68519	38149	31.07%
LB17-16-007	119037	100004	100004	60282	34820	29.25%
LB17-16-008	105227	83766	83766	51094	24924	23.69%
LB17-16-009	203605	171129	171129	85039	50089	24.60%
LB17-16-010	143034	121132	121132	105902	44471	31.09%

Is it a lot?

Mehrbod_Estaki · March 4, 2019, 9:36pm

Hi @Jo_mee,
If I may qiime in here for a moment.
It's great to see users diving so deep into this stuff because it really does highlight how much expertise goes into these analyses and how specific each analysis can be. And thanks for sharing your results and searching around on the forum as well.
I'll start by the conclusion which is to say what you are seeing is perfectly normal and nothing unexpected is happening.
The min Phred score 20 that you mentioned indeed was a unofficial standard in the field when we were working with OTU clustering methods. Before denoising methods such as DADA2/Deblur/Unoise came out, that was necessary as one means of ensuring we weren't introducing too much error by allowing low quality reads. With these newer methods however, they attempt to correct these base-calls and so do not rely on a min Phred-score thresholds, within reason. This is why for example DADA2 by default has a truncQ of 2 and I've personally never needed to change that. In fact, if you increase that number to 10-20 as you have done in some of your simulations it will discard too many reads when we could be just correcting them and using them. In the case of DADA2 where an error model is built first, it may even prevent the proper building of that model, so overall, unless you have a very specific reason to do so I would recommend just leaving it at the default.
Moving on to the main cause of your losses. You mentioned that you have 2x250bp V3-V4 reads. With the most common V3-V4 primers you will have a ~460bp amplicon, but with a 2x250 bp run you will have a maximum of 500bp reads which means there is only 40bp of overlap. DADA2 requires a minimum 20bp overlap for proper merging, otherwise it will toss any reads (both forward and reverse) that it can't merge. Take into consideration the natural variation of this amplicon length meaning some true taxa would need more than 20bp overlap, and the fact that we need to truncate the poor quality tails of our reads on the 3' (where merging occurs). All these play a part in failed merging which is what you are seeing here. This is actually very common even in 2x300bp V3-V4 runs if the 3' tails are poor in quality, or in your case 2x250 runs. Within my own colleagues I tend to advise against 2x250 runs for V3-V4 regions since most of the time you end up not using the reverse reads anyways. Which brings me to my final thought.
Your default setting running PE actually comes up with a reasonable number of reads, with your lowest sample having over 3000 reads. But if you really need more reads for your analysis, I would suggest you stick with just the forward reads. You do retain lots more reads, but you lose a little resolution in the trade. Depending on your study, this may not be a big issue at all, in fact, given that your forward reads are in good shape and you can keep retain most of them, say at 240bp and this isn't a huge loss in resolution compared to 450bp ( See Fig1 of this Wang et al. 2007)
Hope this clarifies some of your questions.

system · April 5, 2019, 4:03am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.