This may be a simple question but after much searching I am unable to find the answer so I'm hoping someone can help me.
Why does the DADA2 default settings have the PHRED score set to 2? This seems far too low and as far as I understand it mean there would be a higher chance of the base being called in correctly than correctly?
Despite this I rarely see any body alter the default quality cut off? Only the truncation length and trim parameters.
If anyone can enlighten me that would be most appreciated.
Welcome to the forum, and that’s an excellent question! Agreed, that using older methods such as OTU clustering with a phred score of 2 would have been risky at best, however unlike previous methods where the phred score was simply used as a cutoff threshold, dada2 is inferring and correcting reads based not just on the quality scores but also the abundance of a feature, which are incorporated into its error-model. This allows it to infer highly accurate calls even in instances where a phred score is not great, assuming that other copies of that read exist with higher quality scores. In addition, keep in mind that with dada2 we are still excluding low quality nts by trimming low quality tails using --p-trim and –p-trunc and in these situations we do use more familiar q-score threshold such as those we saw with quality score based filtering methods prior to OTU clustering. For example, I generally trim all my sequences to instances where the median quality score seems to drop below 20-25.
As for why the default socre is 2 in dada2, this appears to have had some special consideration as per this discussion here.
All that said, you are of course encouraged to play around with these parameters to select what works best for you, you could try increasing the default score and see how it compares. I would certainly be interested in that comparison so if you do feel free to share your results here! My guess is you would see some minor differences, with higher qscore value giving you less reads and slightly less alpha diversity.
Thanks for getting back to me so quickly and with a comprehensive answer.
I've actually already ran my samples at different quality scores - here's a table of consistent trunc and trim parameters with a varying PHRED score as well as my quality data following cutadapt removal of adaptors.
There's no change up to Q5 but then looking at the quality score graph there appears to be some kind of cut off at that point anyway? There is a significant loss of reads though as soon as the even Q7 which I'm struggling to understand as looking at the graph practically no reads should be affected by this looking at the trunc parameters?
My amplicons should be around 411bp using the 515F-926R EMP primers.
Thanks for sharing this data! I think your results are fairly plausible and nicely support the default parameter. From how I understand, DADA2 will truncate the reads at the position where the first instance of truncq=n occurs then at the next step it will discard any whole reads that are shorter than your truncate parameters so its likely you are losing these reads at this step. Also remember that those plots are created by a subset of 10,000 reads only and not from all of your reads (it would be nice if we could change that value!), so it is entirely possible that your loss in reads is happening because there are low q values in the millions of other reads not shown.
Ooh! Excuse me then, I must have missed the update on that. Awesome! And yeah a shortcut for ‘all’ would be pretty convenient.
Though the above plot was still produced using the default 10,000 reads so this doesn’t change the point I mentioned earlier, but perhaps re-running the plot with all the reads can confirm our suspicion.
The quality truncation at q-score 2 is really just for older Illumina software where a score of 2 was code for "I don't know what's going on anymore" and any bases after the first 2 often were poor. These days, it's basically superfluous in most cases, and I'd recommend using max-ee as the quality filter in almost all cases, in conjunction with trunc-len to truncate off low quality suquence tails.
trunc-q will truncate the sequence at the first instance of quality score X. It acts on both the forward and reverse sequences independently. Then trunc-len acts, and will truncate any sequences that are longer than trunc-len to that value, and throw away any sequences that are shorter than trunc-len. So I think the behavior above is explained by some reads being truncated earlier than trunc-len, mostly in the reverse reads, as trunc-q is increased, and therefore being discarded.
I found this topic and felt that you could help me regarding the quality of my data.
I have sequences from am Ilumina Miseq run (Using a 16s analysis in the V3-V4 region), I receive the sequences demultiplexed and import them in qiime2. So I got the demux.qzv file and put in the qiime view, and this is my result:
So I'm confused where I have to truncate and trim? The middle of my sequences are in a too low quality, this is unusual because normally we see the low quality in the final of the sequences, how can I deal with it?
I have split your question off into a new post, which can be found here. Friendly reminder that if you are reading an old post that is related to your question, please always create a new post so that we can accurately track each individual question that may come up. Thanks!