Why do the DADA2 default setting have such a low PHRED score?

Hello,

This may be a simple question but after much searching I am unable to find the answer so I’m hoping someone can help me.

Why does the DADA2 default settings have the PHRED score set to 2? This seems far too low and as far as I understand it mean there would be a higher chance of the base being called in correctly than correctly?

Despite this I rarely see any body alter the default quality cut off? Only the truncation length and trim parameters.

If anyone can enlighten me that would be most appreciated.

Cheers,

Sam

1 Like

Hi @SamWilliamsUOB,
Welcome to the forum, and that’s an excellent question! Agreed, that using older methods such as OTU clustering with a phred score of 2 would have been risky at best, however unlike previous methods where the phred score was simply used as a cutoff threshold, dada2 is inferring and correcting reads based not just on the quality scores but also the abundance of a feature, which are incorporated into its error-model. This allows it to infer highly accurate calls even in instances where a phred score is not great, assuming that other copies of that read exist with higher quality scores. In addition, keep in mind that with dada2 we are still excluding low quality nts by trimming low quality tails using --p-trim and –p-trunc and in these situations we do use more familiar q-score threshold such as those we saw with quality score based filtering methods prior to OTU clustering. For example, I generally trim all my sequences to instances where the median quality score seems to drop below 20-25.
As for why the default socre is 2 in dada2, this appears to have had some special consideration as per this discussion here.
All that said, you are of course encouraged to play around with these parameters to select what works best for you, you could try increasing the default score and see how it compares. I would certainly be interested in that comparison so if you do feel free to share your results here! My guess is you would see some minor differences, with higher qscore value giving you less reads and slightly less alpha diversity.

1 Like

Hi @Mehrbod_Estaki,

Thanks for getting back to me so quickly and with a comprehensive answer.

I’ve actually already ran my samples at different quality scores - here’s a table of consistent trunc and trim parameters with a varying PHRED score as well as my quality data following cutadapt removal of adaptors.

There’s no change up to Q5 but then looking at the quality score graph there appears to be some kind of cut off at that point anyway? There is a significant loss of reads though as soon as the even Q7 which I’m struggling to understand as looking at the graph practically no reads should be affected by this looking at the trunc parameters?

My amplicons should be around 411bp using the 515F-926R EMP primers.

I’d be interested to hear your take on this

File name Trunc-forward Trunc-reverse PHRED Trim Input Filtered Denoised Merged Non-Chimeric Reads Lost Percentage Lost
SMP-truncr-172-truncf-279-trim10-DADA2_denoising-stats.qzv 279 172 2 10 404571 332085 332085 309004 279402 125169 30.94%
SMP-truncr-172-truncf-279-trim10-P5-DADA2_denoising-stats.qzv 279 172 5 10 404571 332085 332085 309004 279402 125169 30.94%
SMP-truncr-172-truncf-279-trim10-P7-DADA2_denoising-stats.qzv 279 172 7 10 404571 302497 302497 275163 117984 286586 70.84%
SMP-truncr-172-truncf-279-trim10-P10-DADA2_denoising-stats.qzv 279 172 10 10 404571 144897 144897 134545 123017 281554 69.59%
SMP-truncr-172-truncf-279-trim10-P20-DADA2_denoising-stats.qzv 279 172 20 10 404571 51222 51222 47360 43657 360914 89.21%

SMP_pooled_rev_NEX-515_fwd_NEX 926.qzv (297.5 KB)

Thanks,

Sam

1 Like

@SamWilliamsUOB, can I just say that your table is absolutely wonderful!

It is also kind of surprising, @benjjneb are we misunderstanding how --p-trunc-q / truncQ works?

2 Likes

Hi @SamWilliamsUOB,
Thanks for sharing this data! I think your results are fairly plausible and nicely support the default parameter. From how I understand, DADA2 will truncate the reads at the position where the first instance of truncq=n occurs then at the next step it will discard any whole reads that are shorter than your truncate parameters so its likely you are losing these reads at this step. Also remember that those plots are created by a subset of 10,000 reads only and not from all of your reads (it would be nice if we could change that value!), so it is entirely possible that your loss in reads is happening because there are low q values in the millions of other reads not shown.

1 Like

Behold! --p-n (I don’t think there is a shortcut for “all sequences” however.)

1 Like

Ooh! Excuse me then, I must have missed the update on that. Awesome! :partying_face: And yeah a shortcut for ‘all’ would be pretty convenient.
Though the above plot was still produced using the default 10,000 reads so this doesn’t change the point I mentioned earlier, but perhaps re-running the plot with all the reads can confirm our suspicion.

Hi Sam, this is because the primary and recommended quality filtering parameter in DADA2 is “expected errors” (max-ee), which is based on the quality scores but is a better filter than averaging raw quality scores. You can read more about EE filtering here: https://doi.org/10.1093/bioinformatics/btv401

The quality truncation at q-score 2 is really just for older Illumina software where a score of 2 was code for “I don’t know what’s going on anymore” and any bases after the first 2 often were poor. These days, it’s basically superfluous in most cases, and I’d recommend using max-ee as the quality filter in almost all cases, in conjunction with trunc-len to truncate off low quality suquence tails.

3 Likes

trunc-q will truncate the sequence at the first instance of quality score X. It acts on both the forward and reverse sequences independently. Then trunc-len acts, and will truncate any sequences that are longer than trunc-len to that value, and throw away any sequences that are shorter than trunc-len. So I think the behavior above is explained by some reads being truncated earlier than trunc-len, mostly in the reverse reads, as trunc-q is increased, and therefore being discarded.

2 Likes

Cheers all - definitely appreciate the input

If I get a chance to run a larger sub-sample I’ll be sure to share those exciting quality plots!