I'm trying to use DADA2 to quality filter reads for a new dataset and am having issues getting reads to pass the filtering step. I have 473 paired end samples sequenced on an Illumina NextSeq 2000.
I first trimmed the reads using cutadapt and got these interactive quality plots:
These looked weird, but I found this post that explained that the weird look was likely due to binned quality scores and that there's no reason to worry.
I'm trying to use cutoffs of 300 for forward and 290 for reverse (this should give me 126 basepairs of overlap; we sequenced the V3/V4 region so expecting an insert size of 464). Another post (that I can't find again, sorry!) commented that the issue for that person was that they didn't have enough reads that matched their cutoff lengths. Looking at my "Demultiplexed sequence length summary" output from CutAdapt (below) I don't think that's my issue. If I'm interpreting the table correctly, 98% of my trimmed reads are still 301 bases long, yes?
Assuming I trimmed correctly, I used this command to run DADA2: qiime dada2 denoise-paired --i-demultiplexed-seqs cutadapt_out/trimmed_sequences.qza --p-trunc-len-f 300 --p-trunc-len-r 290 --p-n-threads 9 --output-dir dada2_out_trimmed --verbose
The resulting denoising summary is awful, the highest percentage of retained filtered reads is only 0.51%.
Hi @saatkinson,
Have you tried playing around with the trunc parameters?
I suspect that your drop in quality around 280 might be causing this. It seems like you are losing a lot of your reads during the filtering step. This indicates to me that its most likely a quality issue.
I would recommend trimming your reverse reads at around 280 and let us know the results!
I reran DADA2 with the reverse cutoff of 280 and it only barely improved the percentage passing the filter (now 0.67% instead of 0.51%).
Should I keep playing with cutoff values? I'm not sure how to scientifically pick a value for the forward reads since they're all the whisker lines (no blue bars like the reverse reads have).
Yes, I used the same forward read cutoff (300) as my original run.
I've never used these parameters before and don't really understand what they're doing from the dada2 denoise-paired --help menu. Could you point me toward an explanation/instruction on how to adjust the max-ee-f/r values intelligently (for lack of a better phrase)?
Do you think this would be better than changing the --p-max-ee-f/r parameters?
It could be a reason! Please disable it by setting to 0 or try another cutoff that is smaller than 300, for example 290 or 295. BTW, did you remove primers?
It is a number of possible errors you are allowing in your reads. I guess that setting it to 300 will allow 300 errors... I would just play with it by gradually increasing these value to see if there is any effect.
Usually, I go for it if nothing works for dada2. I am not aware of any studies that compared both methods. But a couple of times, Deblur saved the data.
Yes, I used cutadapt to trim the reads and remove primers.
The max-ee infomration/NextSeq binning information was especially helpful! I think I'm starting to understand how this is working.
So setting max-ee to 3 means that the full read is removed if more than 1/2 (3/6) of the q-scores of that read are Q20?
I've run a few iterations to test. Increasing max-ee-f and -r to 3 alone wasn't enough to help, so I added in timanix's suggestion of changing the forward cutoff to 290. Combining the two ideas has gotten the percentage input passed filter up to single digit percentages (highest is 2.5%, which is a huge improvement from 0.51% )!
Here's my DADA2 command in case I'm doing something wrong:
If I set the max-ee to 4 will that allow Q-scores less than 20 to pass the filter? Or is it a strict 2/3 of q-scores that are Q20. I don't mind allowing Q20 to pass, but don't want to drop much farther than that quality-wise.
Edit: tried this and only increased the percent passing the filter by 1%.
Going forward, do you think it's wise to keep changing the max-ee parameter/forward-reverse cutoffs or is it time to follow timanix's other suggestion of using VSEARCH + deblur?
Expected Error is all about cumulative error across the full read. So higher EE means reads with more Q20 bases will be able to pass. You can calculate the exact number.
I usually run DADA2 >5 times with different settings to see what works best. Trying even shorter trimming to increase merging is a good idea. Deblur is another good option!
"yoqfo" you only quality filter once
(false, obviously. Discarding reads that can't join and removing singleton features also serve as quality filters.)
One more idea:
Looks like you trimmed the primers with cutadapt, but didn't discarded reads in which primers were not found. In that case, a lot of reads may be actually shorter that 290, and therefore discarded at the filtering step.
If I am right about it, then adding option to cutadapt "discard untrimmed" (check docs or help for right syntax) will allow you to plot quality scores with more precise length of the reads. Then you can reconsider truncation parameters.
Hope that it will add more percentage to the output!
So I would play with it a little more and then move to Deblur.
I didn't realize this was a thing (note to self, read more options on help pages)!
Running the cutadapt command with the discard-untrimmed flag and playing with the cutoffs of the resulting reads and using a max-ee value of three has seemed to do the trick! My percent reads passing filters are now at 72% on the low end! And I have 62.4 - 73.7% passing chimera checks.