Dada2 denoise - why do I have so many reads filtered out

Hi everyone,

I am blocked on my quality filtering part of analysis and I need some help.
I used dada2 denoising on my microbiome analysis and it seems there are many reads filtered out on the way. In my opinion too many. I have searched through forum and tried to refer what was posted to my case.
https://forum.qiime2.org/t/dada2-denoise-paired-removing-most-sequences-during-filtering/7159
https://forum.qiime2.org/t/dada2-denoise-paired-result-90-loss-in-reads/619
I worry it affects microbial community.

Here is what I did:
Sequencing 2x250, V3-V4 region, Illumina Miseq
I followed moving pictures tutorial.

Commands:

qiime dada2 denoise-paired
–i-demultiplexed-seqs FastQ-demux.qza
–p-trunc-q 20 //I did the same denoising part having only 10 value here, for checking what went wrong
–p-trim-left-f 8
–p-trim-left-r 8
–p-trunc-len-f 240
–p-trunc-len-r 240
–p-n-threads 0
–p-chimera-method consensus
–o-representative-sequences rep-seqs-dada2-chill10.qza
–o-table table-dada2-chill10.qza
–o-denoising-stats stats-dada2_chill10.qza

FastQ-LB18_31-demux.qzv (290.6 KB)

–p-trunc-q 20
stats-dada2_LB18_31.qzv (1.2 MB)
–p-trunc-q 10
stats-dada2_LB18_31-chill10.qzv (1.2 MB)

Are my parameters too strict?
In my opinion problem may be merging (based on --p-trunc-q 10) but I dont know how to find the reason for it.
Based on --p-trunc-q 20 results too many read are filtered out in the first instance.
It is pretty confusing.

Regards,
Joanna

2 Likes

The quality set for truncation will cut everything after the first instance of a base with such quality. With your settings it means that each read is cut at the first base below Q20, which is highly probable in most reads before you’d reach your defined threshold for truncation length, so indeed you’ll have very few reads remaining after filtering. I hope I was clear.

Thank for the answer.
Does it mean that in my case for such a demux.qzv file I should not use this flag?
When I changed it to 10, indeed I got more reads left, yet there merging and chimera search part removed then quite a lot anyway.

Hi @Jo_mee,

I’m having the same problem. My data generated from PE300. Then I tried to follow the Fecal microbiota transplant (FMT) study tutorial and after the de-noising part I got more reads. I’m not sure that is a correct way.

Afaq

Hi @Jo_mee — I think that what @jnesme was suggesting above is that your --p-trunc-q 20 value is responsible for filtering a significant portion of your reads. The default value for this parameter is 2 — what do your results look like when you run with the default?

1 Like

Dear all,

I did some small research on this topic. I hope I present it clear enough. I have tested different options according to what others said in similar topics. And so it goes (sorry for a long post)

  1. dada2, PE reads, default Phred score = 2

qiime dada2 denoise-paired
–i-demultiplexed-seqs FastQ-LB18_31-demux.qza
–p-trim-left-f 8
–p-trim-left-r 8
–p-trunc-len-f 240
–p-trunc-len-r 240
–p-n-threads 0
–p-chimera-method consensus
–o-representative-sequences rep-seqs-dada2-LB18_31-default.qza
–o-table table-dada2-LB18_31-default.qza
–o-denoising-stats stats-dada2_LB18_31-default.qza

Result: stats-dada2_LB18_31-default.qzv (1.2 MB)

  1. dada2, PE reads, chimera - none

qiime dada2 denoise-paired
–i-demultiplexed-seqs FastQ-LB18_31-demux.qza
–p-trunc-q 20
–p-trunc-len-f 0
–p-trunc-len-r 0
–p-n-threads 0
–p-chimera-method none
–o-representative-sequences rep-seqs-dada2-LB18_31-nochim.qza
–o-table table-dada2-LB18_31-nochim.qza
–o-denoising-stats stats-dada2_LB18_31-nochim.qza

Result: stats-dada2_LB18_31-nochim.qzv (1.2 MB)

  1. dada2, only forward reads

qiime dada2 denoise-single
–i-demultiplexed-seqs FastQ-LB18_31-demux-R1.qza
–p-trim-left 8
–p-trunc-len 240
–p-n-threads 0
–p-chimera-method consensus
–o-representative-sequences rep-seqs-dada2-LB18_31-R1-noQS.qza
–o-table table-dada2-LB18_31-R1-noQS.qza
–o-denoising-stats stats-dada2_LB18_31-R1-noQS.qza

Result: stats-dada2_LB18_31-R1-noQS.qzv (1.2 MB)

Also, I did with another Phred score selection. Chimera filtering stayed ‘consensus’ in all
–p-trunc-q 20; stats-dada2_LB18_31-R1.qzv (1.2 MB)
–p-trunc-q 10; stats-dada2_LB18_31-R1-10.qzv (1.2 MB)

Additionally, I counter proof all above I used another method for denoising deblur in case of forward reads only

qiime deblur denoise-16S
–i-demultiplexed-seqs FastQ-LB18_31-demux-R1.qza
–p-trim-length 240
–o-representative-sequences rep-seqs-deblur.qza
–o-table table-deblur.qza
–p-sample-stats
–o-stats deblur-stats.qza

deblur-stats.qzv (198.3 KB)

What I noticed is:

  • what @thermokarst and @jnesme mentioned - it may be one of reasons to decrease the --p-trunc-q value to default version. Yet I always thought that 20 is some kind of a threshold that at first I should check. I dont think yet it is a main reason for such a huge read loss.
  • In other option I send you here I think it may be a problem with merging of the read so I proceeded with analysis of only forward reads
  • deblur denoising showed that reads-raw and reads-derep vary a lot. Here, ‘click’ it was said that if they are not similar it may mean I have singletons. Could you confirm this? All in all the result is similar to forward read only denoising, --p-trunc-q 20

What is the reasonable % of reads that pass good filtering (if the sequencing is good quality offcourse)?

I calculated for all the % of reads that passed denoising. It looks this way:


Top chart is analysis of forward reads R1 in different parameters, bottom chart is forward R1 and reversed R2 reads

If you are truncating your reads at the first instance of Q20, your reads are likely much shorter than they need to be for merging. You are effectively throwing away the nts responsible for allowing merging to happen. Any reads that aren’t merged are lost.

Don’t forget, DADA2 attempts to correct error, which is probably why the default trunc-q is so low.

I see…but I still have doubts. Please check the results of default setup (Q2)

sample-id input filtered denoised merged non-chimeric % left
#q2:types numeric numeric numeric numeric numeric
LB17-16-001 145783 123434 123434 68005 45408 31.15%
LB17-16-002 70462 60081 60081 31014 22621 32.10%
LB17-16-003 119707 96946 96946 73652 30813 25.74%
LB17-16-004 130350 112151 112151 61758 42091 32.29%
LB17-16-005 111168 95150 95150 50107 35143 31.61%
LB17-16-006 122771 99810 99810 68519 38149 31.07%
LB17-16-007 119037 100004 100004 60282 34820 29.25%
LB17-16-008 105227 83766 83766 51094 24924 23.69%
LB17-16-009 203605 171129 171129 85039 50089 24.60%
LB17-16-010 143034 121132 121132 105902 44471 31.09%

Is it a lot?

Hi @Jo_mee,
If I may qiime in here for a moment.
It’s great to see users diving so deep into this stuff because it really does highlight how much expertise goes into these analyses and how specific each analysis can be. And thanks for sharing your results and searching around on the forum as well.
I’ll start by the conclusion which is to say what you are seeing is perfectly normal and nothing unexpected is happening.
The min Phred score 20 that you mentioned indeed was a unofficial standard in the field when we were working with OTU clustering methods. Before denoising methods such as DADA2/Deblur/Unoise came out, that was necessary as one means of ensuring we weren’t introducing too much error by allowing low quality reads. With these newer methods however, they attempt to correct these base-calls and so do not rely on a min Phred-score thresholds, within reason. This is why for example DADA2 by default has a truncQ of 2 and I’ve personally never needed to change that. In fact, if you increase that number to 10-20 as you have done in some of your simulations it will discard too many reads when we could be just correcting them and using them. In the case of DADA2 where an error model is built first, it may even prevent the proper building of that model, so overall, unless you have a very specific reason to do so I would recommend just leaving it at the default.
Moving on to the main cause of your losses. You mentioned that you have 2x250bp V3-V4 reads. With the most common V3-V4 primers you will have a ~460bp amplicon, but with a 2x250 bp run you will have a maximum of 500bp reads which means there is only 40bp of overlap. DADA2 requires a minimum 20bp overlap for proper merging, otherwise it will toss any reads (both forward and reverse) that it can’t merge. Take into consideration the natural variation of this amplicon length meaning some true taxa would need more than 20bp overlap, and the fact that we need to truncate the poor quality tails of our reads on the 3’ (where merging occurs). All these play a part in failed merging which is what you are seeing here. This is actually very common even in 2x300bp V3-V4 runs if the 3’ tails are poor in quality, or in your case 2x250 runs. Within my own colleagues I tend to advise against 2x250 runs for V3-V4 regions since most of the time you end up not using the reverse reads anyways. Which brings me to my final thought.
Your default setting running PE actually comes up with a reasonable number of reads, with your lowest sample having over 3000 reads. But if you really need more reads for your analysis, I would suggest you stick with just the forward reads. You do retain lots more reads, but you lose a little resolution in the trade. Depending on your study, this may not be a big issue at all, in fact, given that your forward reads are in good shape and you can keep retain most of them, say at 240bp and this isn’t a huge loss in resolution compared to 450bp ( See Fig1 of this Wang et al. 2007)
Hope this clarifies some of your questions.

7 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.