I am working with 16s data (V4 region, with 515F/806R primers), using MiSeq and getting paired-end data with 250bps.
I have the following quality map (also upload the original file). I have been checking several sources of information, including tutorials and even questions in the forum but, I'm a bit unsure which quality criteria should I use in my case.
Initially, I used more "relaxed" criteria, with forward 245 and reverse 183. However, when discussing with some colleagues, I got the suggestion of being more conservative, or aggressive, with the trimming - considering that, with the merging of the sequences, it would be fine (overlap region should be enough for successful merging. Thus, I got 210 for forward reads and 150 for reverse ones.
Can anyone help me to validate either approach? or may be share some information source of these issues for paired end data (most times, I can only find for single end)?
There's no harm in trying both (and others) to see which gives the best best results. I usually recommend to trim at the first position where the first quartile touches a quality score of 20. That seems to be around forward 218 reverse 150 for you. Of course overlap has to be taken into account as you mentioned.
Can I ask you, is there any reference concerning that cutoff of 20? I have seen some people refering to it but, I was never been able to find the source.
In addition, is there any discussion/post/reference on the effects of relaxed vs strict trimming that you know about?
Can I ask you, is there any reference concerning that cutoff of 20? I have seen some people refering to it but, I was never been able to find the source.
No reference that I know of, just a rule of thumb. Cutoffs are arbitrary because quality scores are a continuum of chance of a base being miscalled.
In addition, is there any discussion/post/reference on the effects of relaxed vs strict trimming that you know about?
There are plenty of such discussions, just search on this forum for them.
These discussions are going to be inferior to empirically testing what performs best for your data.
Hey @colinvwood. I was reviewing the plots and, for the forward sequences, they reach Q20 in the first quartile near 190's instead of 218. However, the bases after 190 have higher quality scores.
The question is should I keep the cut at 218 or be more stringent? If I more relaxed, I will include only one problematic position but, will increase overlap. Stringent is following the rule to the letter.
Since it's just one dip to around 20 you can see what happens when you ignore that position. Try not ignoring it as well of course and see which works better.
Maybe not major, but still significant. It looks like your samples start at around 50k reads, so retaining 3% more of them means you have 1500 more reads to analyze downstream.