Dada2 denoise - why do I have so many reads filtered out

Mehrbod_Estaki · March 4, 2019, 9:36pm

Hi @Jo_mee,
If I may qiime in here for a moment.
It's great to see users diving so deep into this stuff because it really does highlight how much expertise goes into these analyses and how specific each analysis can be. And thanks for sharing your results and searching around on the forum as well.
I'll start by the conclusion which is to say what you are seeing is perfectly normal and nothing unexpected is happening.
The min Phred score 20 that you mentioned indeed was a unofficial standard in the field when we were working with OTU clustering methods. Before denoising methods such as DADA2/Deblur/Unoise came out, that was necessary as one means of ensuring we weren't introducing too much error by allowing low quality reads. With these newer methods however, they attempt to correct these base-calls and so do not rely on a min Phred-score thresholds, within reason. This is why for example DADA2 by default has a truncQ of 2 and I've personally never needed to change that. In fact, if you increase that number to 10-20 as you have done in some of your simulations it will discard too many reads when we could be just correcting them and using them. In the case of DADA2 where an error model is built first, it may even prevent the proper building of that model, so overall, unless you have a very specific reason to do so I would recommend just leaving it at the default.
Moving on to the main cause of your losses. You mentioned that you have 2x250bp V3-V4 reads. With the most common V3-V4 primers you will have a ~460bp amplicon, but with a 2x250 bp run you will have a maximum of 500bp reads which means there is only 40bp of overlap. DADA2 requires a minimum 20bp overlap for proper merging, otherwise it will toss any reads (both forward and reverse) that it can't merge. Take into consideration the natural variation of this amplicon length meaning some true taxa would need more than 20bp overlap, and the fact that we need to truncate the poor quality tails of our reads on the 3' (where merging occurs). All these play a part in failed merging which is what you are seeing here. This is actually very common even in 2x300bp V3-V4 runs if the 3' tails are poor in quality, or in your case 2x250 runs. Within my own colleagues I tend to advise against 2x250 runs for V3-V4 regions since most of the time you end up not using the reverse reads anyways. Which brings me to my final thought.
Your default setting running PE actually comes up with a reasonable number of reads, with your lowest sample having over 3000 reads. But if you really need more reads for your analysis, I would suggest you stick with just the forward reads. You do retain lots more reads, but you lose a little resolution in the trade. Depending on your study, this may not be a big issue at all, in fact, given that your forward reads are in good shape and you can keep retain most of them, say at 240bp and this isn't a huge loss in resolution compared to 450bp ( See Fig1 of this Wang et al. 2007)
Hope this clarifies some of your questions.