Really quick observation, and maybe this has been addressed. I’ve been trouble shooting a lot of DADA2 issues for the lab. Does anyone have a quick place/reference where sequences are lost in the DADA2 de-noise process?
For example:
This was issue at the chimera filtering step, but for another run, I’m losing sequences between filter/denoise steps and I am having a hard time trying to look up the trouble shooting. I figured out the filter step as it seems that the issue is due to poor overlap on the reads (the reverse reads for my run has poor quality). Does one have a primer or know of where to trouble shoot dada2 steps? Thank you in advance.
Hi @ben,
Definitely not a one-stop shop. There are many steps, each with its own troubleshooting peculiarities. We should really compile a single forum post to help navigate the process!
Oh that’s a tough one — I am not sure we have had to troubleshoot that step before, since usually filtering or merging is where the big issues happen. Maybe you could upload the qzv of your dada2 stats?
Filtering has nothing to do with overlap — at that stage the reads are still being filtered independently. Rather, you were not trimming enough on the reverse read (but trim more and then you have overlap issues).
No, unfortunately I do not even think the dada2 R tutorial has info like this, though the FAQs cover some of this.
Let’s tackle problem 1 first: your denoising loss issue, then maybe you or someone else could give us a hand to tackle problem 2: putting together a step-by-step dada2 troubleshooting guide!
Could you change the title, I don’t think I want to start another thread, but I’ve been having some trouble with sequence loss in DADA2 with Nextera Illumina sequences. Could the mods change the title of the issue.
Is this normal sequence loss across the run? Here’s what I have tried:
Curtailed the forward and reverse trunc lengths which helped loss between input->filtered, but this strategy still had me lose a lot of sequences at the denoised step.
Is this simply an issue with the run itself? Too noisy?
Instead, can I just grab the V4 region and use that for DADA2?
Are there any parts of the Nextera primer/linker/indexes left in the rep-seqs?
I have tried already q2-cutadapt to see if there’s adapters left (but it ended running and say maybe 0.1% of sequences in the R1 or R3 depending on how we oriented cutadapt.
Any other trouble shooting thoughts? To be honest, the data from the table, despite the lose of sequences actually is what we predicted to look like, so, I am assuming the sequences loss along the way are actually trash sequences which with DADA2 doing its job, is to be expected.
Thanks for including view links! You are saving my fingers precious keystrokes
Seems like the issue is you are including some low-quality segments with your truncation parameters, leading to high loss at both filtering and denoising.
That was going to be my first suggestion. 270 is too long on the reverse, you are moving into dangerously low-qual territory. 270 is fine on forward (I’d even push it to 290 if you need it for merging!), but maybe do 250 on the reverse, or even 220 if you can afford it?
No beautiful run, but less so on the reverse. truncate as much as you can afford.
I’m assuming this would be the equivalent of just using the forward read. That’s fine if you need to, and may be worth comparing the results, but I think you can try to juice out more merged reads.
No clue! You could use cutadapt to test this (trim the primer and see what comes off)
guess that answers that question… try with just the primers
Yep doing its job but you are making that job more difficult by truncating the reverse reads too little. Pls troubleshoot the following and let us know what you find:
cutadapt trim primers. Anything come off?
adjust truncation parameters on the reverse. 250? 220? How low can you go!?!? Push forward to 290 if you need more slack.
Thanks, I actually did a really dramatic trunc where I think the V3V4 region was only 460 bp (I think I did the math for the minimal overlap trunc 230 or so compared to trunc 270). I will post those results (they are sitting the work NAS, so I can’t get to them at this time.
Spoiler: I lose very little maybe 10-20% over input -> filter. But my losses across the rest of the steps were the same, where I ended up with almost the same # of sequences at the non-chimeric step. It goes without saying when you use a more stringent trunc 230 (both forward and reverse), it actually puts in slightly less reads.
I actually went slightly nutty and cut adapt both the primers/adapters at multiple steps. I could not find anything less. I am assuming since it was a Nextera Illumina run that the sequences were sequenced downstream of the paired adapters.
So, to trouble shoot:
I will trunc forward 290 and trunc rev 230-ish, since I will be merging
I will cut adapt some more sections to see what I can find
but are the reads being lost at denoising or merging? At merging it definitely makes sense — like I’ve said, it’s a balancing act. But truncating more should not cause more loss at denoising (I think)
sounds that way! so that’s good news.
Sounds good!
Probably non-essential — I’d recommend maybe just process forward reads as single-end as an alternative test to see if the lower-qual reverse reads are the problem.
It looks like moving the overlap around and trying to get rid of the noisy poor reverse reads did nothing for the loss of sequences. The taxonomic assignments were just as expected, so there's that.
weird. Honestly I am not sure what it means when reads are lost at the denoising stage vs. filtering stage — this does not appear to be documented anywhere and is not obvious to me.
Another good reason why there should be a quick reference guide for troubleshooting — maybe something you would be interested in contributing?
For the purposes of your run, though, it may be time to just trust in dada2 unless if you can think of something that could cause this issue — e.g., do you happen to have barcodes still in the reads? Perhaps singletons are dropped at this step if the most probable correct sequence cannot be found?
Thanks, I agree, I think what I’m ending up with is actually good sequences, so while there is significant loss I do not think I’m actually losing anything valuable. I reviewed the run with the group we received the sequences from and as far as I am concerned the predicted taxa are in the taxa bar plot and consistent with their hypothesis.
I would like to work on that, but I need to learn more about DADA2 first!
@ben,
I consulted with @benjjneb, who advised me that read loss at the denoising step can come from artifactual sources, e.g., low quality libraries and off-target amplification.
This may make a lot of sense for your samples (lung if I am not mistaken? or is this a different sample type?). If this is the case, this read loss is desirable!
So the good news is that you still get adequate read yields for most samples and:
Thanks, I totally agree, we are slowly switching some of our pipeline to use DADA2 and I am sure that @benjjneb will be someone I will have to consult. Thanks for the insight, these were actually sputum, but I think they were processed outside of our control.
For our lung samples, BAL/lung tissue, we actually have very good recovery through the 16S V4 amplification. I believe that the quality of the V3V4 run may be the issue. I will discuss with the lab where these samples come from. Ben
edit: I should mention that these were Nextera primers, what was special about them is that they were customized for their run (a group studying sputum in lung disease). They had a couple of linkers with built in “extensions”. Like the linkers had spaces for increasing random bases N --> NN --> NNN --> NNNN so they were slightly modified. I wonder if that’s what I’m seeing.