DADA2 filtering out everything

saatkinson · April 22, 2024, 5:49pm

Hello,

I'm trying to use DADA2 to quality filter reads for a new dataset and am having issues getting reads to pass the filtering step. I have 473 paired end samples sequenced on an Illumina NextSeq 2000.

I first trimmed the reads using cutadapt and got these interactive quality plots:

These looked weird, but I found this post that explained that the weird look was likely due to binned quality scores and that there's no reason to worry.

I'm trying to use cutoffs of 300 for forward and 290 for reverse (this should give me 126 basepairs of overlap; we sequenced the V3/V4 region so expecting an insert size of 464). Another post (that I can't find again, sorry!) commented that the issue for that person was that they didn't have enough reads that matched their cutoff lengths. Looking at my "Demultiplexed sequence length summary" output from CutAdapt (below) I don't think that's my issue. If I'm interpreting the table correctly, 98% of my trimmed reads are still 301 bases long, yes?

Assuming I trimmed correctly, I used this command to run DADA2:
qiime dada2 denoise-paired --i-demultiplexed-seqs cutadapt_out/trimmed_sequences.qza --p-trunc-len-f 300 --p-trunc-len-r 290 --p-n-threads 9 --output-dir dada2_out_trimmed --verbose

The resulting denoising summary is awful, the highest percentage of retained filtered reads is only 0.51%.

I'm at a loss for what I'm doing wrong to get these reads QC'd. Any advice/thoughts would be greatly appreciated!

Thank you,
Samantha

cherman2 · April 22, 2024, 5:56pm

Hi @saatkinson,
Have you tried playing around with the trunc parameters?

I suspect that your drop in quality around 280 might be causing this. It seems like you are losing a lot of your reads during the filtering step. This indicates to me that its most likely a quality issue.

I would recommend trimming your reverse reads at around 280 and let us know the results!

saatkinson · April 22, 2024, 6:19pm

Hi @cherman2,

I reran DADA2 with the reverse cutoff of 280 and it only barely improved the percentage passing the filter (now 0.67% instead of 0.51%).

Should I keep playing with cutoff values? I'm not sure how to scientifically pick a value for the forward reads since they're all the whisker lines (no blue bars like the reverse reads have).

Thanks,
Samantha

timanix · April 22, 2024, 6:54pm

Hello!

Did you provide any cutoff for forward reads?

In my experience with NextSeq and similar quality plots playing with --p-max-ee-r and --p-max-ee-f parameters can improve the output.

As an alternative, one also may consider merging reads with VSEARCH and denoising with Deblur.

Best,

saatkinson · April 22, 2024, 7:13pm

Hi!

Yes, I used the same forward read cutoff (300) as my original run.

I've never used these parameters before and don't really understand what they're doing from the dada2 denoise-paired --help menu. Could you point me toward an explanation/instruction on how to adjust the max-ee-f/r values intelligently (for lack of a better phrase)?

Do you think this would be better than changing the --p-max-ee-f/r parameters?

Thanks so much!
Samantha

timanix · April 22, 2024, 7:33pm

It could be a reason! Please disable it by setting to 0 or try another cutoff that is smaller than 300, for example 290 or 295. BTW, did you remove primers?

It is a number of possible errors you are allowing in your reads. I guess that setting it to 300 will allow 300 errors... I would just play with it by gradually increasing these value to see if there is any effect.

Usually, I go for it if nothing works for dada2. I am not aware of any studies that compared both methods. But a couple of times, Deblur saved the data.

colinbrislawn · April 22, 2024, 8:52pm

Afternoon everyone,

EE == Expected Error comes from this paper:
https://academic.oup.com/bioinformatics/article/31/21/3476/194979

This is a better idea than the average q-score, which is what we used to do
https://www.drive5.com/usearch/manual/avgq.html

I think I figured it out:
Those binned quality scores from the NextSeq are funky.

RTA3 assigns 1 of 4 Q scores to each base call;
0, 12, 26 or 34 for runs of 300 cycles or less,
0, 9, 20, 34 for runs of 600 cycles.

(Q 34 = 0.0004 error rate )x600 bases = 0.24 errors, okay
(Q 20 = 0.01 error rate )x600 bases = 6 errors

With an expected error filter of 1, the full read is removed if more than 1/6 of your q-scores are Q20.
And your R2 is half Q20

yoqfo

--p-max-ee-r 3 --p-max-ee-f 3

saatkinson · April 23, 2024, 5:11pm

Hi all,

Thank you all so much for your help on this!

Yes, I used cutadapt to trim the reads and remove primers.

The max-ee infomration/NextSeq binning information was especially helpful! I think I'm starting to understand how this is working.

So setting max-ee to 3 means that the full read is removed if more than 1/2 (3/6) of the q-scores of that read are Q20?

I've run a few iterations to test. Increasing max-ee-f and -r to 3 alone wasn't enough to help, so I added in timanix's suggestion of changing the forward cutoff to 290. Combining the two ideas has gotten the percentage input passed filter up to single digit percentages (highest is 2.5%, which is a huge improvement from 0.51% )!

Here's my DADA2 command in case I'm doing something wrong:

qiime dada2 denoise-paired --i-demultiplexed-seqs cutadapt_out/trimmed_sequences.qza \
--p-trunc-len-f 290 --p-trunc-len-r 280 --p-n-threads 9 \
--p-max-ee-f 3 --p-max-ee-r 3 \
--output-dir dada2_out_trimmed --verbose

If I set the max-ee to 4 will that allow Q-scores less than 20 to pass the filter? Or is it a strict 2/3 of q-scores that are Q20. I don't mind allowing Q20 to pass, but don't want to drop much farther than that quality-wise.
Edit: tried this and only increased the percent passing the filter by 1%.

Going forward, do you think it's wise to keep changing the max-ee parameter/forward-reverse cutoffs or is it time to follow timanix's other suggestion of using VSEARCH + deblur?

Thanks,
Samantha

P.S. What does yoqfo mean ?

colinbrislawn · April 23, 2024, 5:45pm

Expected Error is all about cumulative error across the full read. So higher EE means reads with more Q20 bases will be able to pass. You can calculate the exact number.

I usually run DADA2 >5 times with different settings to see what works best. Trying even shorter trimming to increase merging is a good idea. Deblur is another good option!

"yoqfo" you only quality filter once
(false, obviously. Discarding reads that can't join and removing singleton features also serve as quality filters.)

timanix · April 23, 2024, 5:46pm

That is great progress!

One more idea:
Looks like you trimmed the primers with cutadapt, but didn't discarded reads in which primers were not found. In that case, a lot of reads may be actually shorter that 290, and therefore discarded at the filtering step.
If I am right about it, then adding option to cutadapt "discard untrimmed" (check docs or help for right syntax) will allow you to plot quality scores with more precise length of the reads. Then you can reconsider truncation parameters.

Hope that it will add more percentage to the output!

So I would play with it a little more and then move to Deblur.

Best,

saatkinson · May 1, 2024, 2:04pm

I didn't realize this was a thing (note to self, read more options on help pages)!

Running the cutadapt command with the discard-untrimmed flag and playing with the cutoffs of the resulting reads and using a max-ee value of three has seemed to do the trick! My percent reads passing filters are now at 72% on the low end! And I have 62.4 - 73.7% passing chimera checks.

Thank you both so much for your help!!

Best,

Samantha

system · June 1, 2024, 8:04pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.