DADA2 denoising for V4 paired end samples

Hi everyone!

I am working with 16s data (V4 region, with 515F/806R primers), using MiSeq and getting paired-end data with 250bps.

I have the following quality map (also upload the original file). I have been checking several sources of information, including tutorials and even questions in the forum but, I'm a bit unsure which quality criteria should I use in my case.


SI_GF_Samples.qzv (310.8 KB)

Initially, I used more "relaxed" criteria, with forward 245 and reverse 183. However, when discussing with some colleagues, I got the suggestion of being more conservative, or aggressive, with the trimming - considering that, with the merging of the sequences, it would be fine (overlap region should be enough for successful merging. Thus, I got 210 for forward reads and 150 for reverse ones.

Can anyone help me to validate either approach? or may be share some information source of these issues for paired end data (most times, I can only find for single end)?

Thanks in advance,

André

Hello @asbarros,

There's no harm in trying both (and others) to see which gives the best best results. I usually recommend to trim at the first position where the first quartile touches a quality score of 20. That seems to be around forward 218 reverse 150 for you. Of course overlap has to be taken into account as you mentioned.

Hi @colinvwood ! Thanks so much for your answer.

Can I ask you, is there any reference concerning that cutoff of 20? I have seen some people refering to it but, I was never been able to find the source.

In addition, is there any discussion/post/reference on the effects of relaxed vs strict trimming that you know about?

Thanks once again.

André

Hello @asbarros,

Can I ask you, is there any reference concerning that cutoff of 20? I have seen some people refering to it but, I was never been able to find the source.

No reference that I know of, just a rule of thumb. Cutoffs are arbitrary because quality scores are a continuum of chance of a base being miscalled.

In addition, is there any discussion/post/reference on the effects of relaxed vs strict trimming that you know about?

There are plenty of such discussions, just search on this forum for them.

These discussions are going to be inferior to empirically testing what performs best for your data.

Hey @colinvwood. I was reviewing the plots and, for the forward sequences, they reach Q20 in the first quartile near 190's instead of 218. However, the bases after 190 have higher quality scores.

The question is should I keep the cut at 218 or be more stringent? If I more relaxed, I will include only one problematic position but, will increase overlap. Stringent is following the rule to the letter.

Thanks once again

Hello @asbarros,

Since it's just one dip to around 20 you can see what happens when you ignore that position. Try not ignoring it as well of course and see which works better.

Considering I have no positive controls, which criteria should I compare between approaches to confirm which is best?

Thanks

Hello @asbarros,

The percentage of sequences for each sample that are retained after dada2.

1 Like

So I have run both and there is an improvement of ~3% in a more stringent criteria than relaxed one. Is this considered a major improvement?

SI_GF_F195-Stats.qzv (1.2 MB)
SI_GF_F218-Stats.qzv (1.2 MB)

Hello @asbarros,

Maybe not major, but still significant. It looks like your samples start at around 50k reads, so retaining 3% more of them means you have 1500 more reads to analyze downstream.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.