Why do the DADA2 default setting have such a low PHRED score?

SamWilliamsUOB · June 10, 2019, 2:05pm

Hello,

This may be a simple question but after much searching I am unable to find the answer so I'm hoping someone can help me.

Why does the DADA2 default settings have the PHRED score set to 2? This seems far too low and as far as I understand it mean there would be a higher chance of the base being called in correctly than correctly?

Despite this I rarely see any body alter the default quality cut off? Only the truncation length and trim parameters.

If anyone can enlighten me that would be most appreciated.

Cheers,

Sam

Mehrbod_Estaki · June 10, 2019, 8:28pm

Hi @SamWilliamsUOB,
Welcome to the forum, and that's an excellent question! Agreed, that using older methods such as OTU clustering with a phred score of 2 would have been risky at best, however unlike previous methods where the phred score was simply used as a cutoff threshold, dada2 is inferring and correcting reads based not just on the quality scores but also the abundance of a feature, which are incorporated into its error-model. This allows it to infer highly accurate calls even in instances where a phred score is not great, assuming that other copies of that read exist with higher quality scores. In addition, keep in mind that with dada2 we are still excluding low quality nts by trimming low quality tails using --p-trim and --p-trunc and in these situations we do use more familiar q-score threshold such as those we saw with quality score based filtering methods prior to OTU clustering. For example, I generally trim all my sequences to instances where the median quality score seems to drop below 20-25.
As for why the default socre is 2 in dada2, this appears to have had some special consideration as per this discussion here.
All that said, you are of course encouraged to play around with these parameters to select what works best for you, you could try increasing the default score and see how it compares. I would certainly be interested in that comparison so if you do feel free to share your results here! My guess is you would see some minor differences, with higher qscore value giving you less reads and slightly less alpha diversity.

SamWilliamsUOB · June 11, 2019, 9:38am

Hi @Mehrbod_Estaki,

Thanks for getting back to me so quickly and with a comprehensive answer.

I've actually already ran my samples at different quality scores - here's a table of consistent trunc and trim parameters with a varying PHRED score as well as my quality data following cutadapt removal of adaptors.

There's no change up to Q5 but then looking at the quality score graph there appears to be some kind of cut off at that point anyway? There is a significant loss of reads though as soon as the even Q7 which I'm struggling to understand as looking at the graph practically no reads should be affected by this looking at the trunc parameters?

My amplicons should be around 411bp using the 515F-926R EMP primers.

I'd be interested to hear your take on this

File name	Trunc-forward	Trunc-reverse	PHRED	Trim	Input	Filtered	Denoised	Merged	Non-Chimeric	Reads Lost	Percentage Lost
SMP-truncr-172-truncf-279-trim10-DADA2_denoising-stats.qzv	279	172	2	10	404571	332085	332085	309004	279402	125169	30.94%
SMP-truncr-172-truncf-279-trim10-P5-DADA2_denoising-stats.qzv	279	172	5	10	404571	332085	332085	309004	279402	125169	30.94%
SMP-truncr-172-truncf-279-trim10-P7-DADA2_denoising-stats.qzv	279	172	7	10	404571	302497	302497	275163	117984	286586	70.84%
SMP-truncr-172-truncf-279-trim10-P10-DADA2_denoising-stats.qzv	279	172	10	10	404571	144897	144897	134545	123017	281554	69.59%
SMP-truncr-172-truncf-279-trim10-P20-DADA2_denoising-stats.qzv	279	172	20	10	404571	51222	51222	47360	43657	360914	89.21%

SMP_pooled_rev_NEX-515_fwd_NEX 926.qzv (297.5 KB)

Thanks,

Sam

ebolyen · June 11, 2019, 5:42pm

@SamWilliamsUOB, can I just say that your table is absolutely wonderful!

It is also kind of surprising, @benjjneb are we misunderstanding how --p-trunc-q / truncQ works?

Mehrbod_Estaki · June 11, 2019, 6:47pm

Hi @SamWilliamsUOB,
Thanks for sharing this data! I think your results are fairly plausible and nicely support the default parameter. From how I understand, DADA2 will truncate the reads at the position where the first instance of truncq=n occurs then at the next step it will discard any whole reads that are shorter than your truncate parameters so its likely you are losing these reads at this step. Also remember that those plots are created by a subset of 10,000 reads only and not from all of your reads (it would be nice if we could change that value!), so it is entirely possible that your loss in reads is happening because there are low q values in the millions of other reads not shown.

ebolyen · June 11, 2019, 6:56pm

Behold! --p-n (I don't think there is a shortcut for "all sequences" however.)

Mehrbod_Estaki · June 11, 2019, 7:21pm

Ooh! Excuse me then, I must have missed the update on that. Awesome! And yeah a shortcut for 'all' would be pretty convenient.
Though the above plot was still produced using the default 10,000 reads so this doesn't change the point I mentioned earlier, but perhaps re-running the plot with all the reads can confirm our suspicion.

benjjneb · June 12, 2019, 12:28pm

Hi Sam, this is because the primary and recommended quality filtering parameter in DADA2 is "expected errors" (max-ee), which is based on the quality scores but is a better filter than averaging raw quality scores. You can read more about EE filtering here: https://doi.org/10.1093/bioinformatics/btv401

The quality truncation at q-score 2 is really just for older Illumina software where a score of 2 was code for "I don't know what's going on anymore" and any bases after the first 2 often were poor. These days, it's basically superfluous in most cases, and I'd recommend using max-ee as the quality filter in almost all cases, in conjunction with trunc-len to truncate off low quality suquence tails.

benjjneb · June 12, 2019, 12:31pm

trunc-q will truncate the sequence at the first instance of quality score X. It acts on both the forward and reverse sequences independently. Then trunc-len acts, and will truncate any sequences that are longer than trunc-len to that value, and throw away any sequences that are shorter than trunc-len. So I think the behavior above is explained by some reads being truncated earlier than trunc-len, mostly in the reverse reads, as trunc-q is increased, and therefore being discarded.

SamWilliamsUOB · June 14, 2019, 1:09pm

Cheers all - definitely appreciate the input

If I get a chance to run a larger sub-sample I'll be sure to share those exciting quality plots!

joaomiranda · August 11, 2021, 12:27pm

Hi everyone

I found this topic and felt that you could help me regarding the quality of my data.
I have sequences from am Ilumina Miseq run (Using a 16s analysis in the V3-V4 region), I receive the sequences demultiplexed and import them in qiime2. So I got the demux.qzv file and put in the qiime view, and this is my result:

So I'm confused where I have to truncate and trim? The middle of my sequences are in a too low quality, this is unusual because normally we see the low quality in the final of the sequences, how can I deal with it?

Thanks in advance

lizgehret · August 13, 2021, 10:32pm

Hi @joaomiranda,

I have split your question off into a new post, which can be found here. Friendly reminder that if you are reading an old post that is related to your question, please always create a new post so that we can accurately track each individual question that may come up. Thanks!

Cheers,
Liz