I'm a new QIIME2 user and a first-time poster. I need some help understanding the DADA2 denoise function and what it's doing at each step.
I'm trying to extract ITS2 sequences from paired-end sequences for classification, but I'm running into issues at the denoise step. Some of my sequences are lost at the initial filter (up to 99%), some are lost at the merge step (up to 99% again), and other samples have ~60% of reads pass all the steps. The results are all over the place. So I think I really need to understand what DADA2 is doing at each of these steps. What are some common causes for a lot of sequences to be initially filtered out? What causes a huge loss of reads at the merge step? How can I change my parameters to get more reads to pass? And how do I know when not to adjust the parameters because they would let low-quality reads into my representative sequences?
I've run these commands so far. Just trimming the primers and then checking the read quality before denoising. Because I'm working with ITS data I'm not truncating at a specific position, as per this explanation.
I would try --p-trunc-len-f 260 & --p-trunc-len-r 190 to remove any low-quality reads. This should hopefully improve the number which pass the filter stage and allow a higher number to be merged successfully.
I tried that and it helped with some samples, but I still have 13 out of 35 samples that lose between 50%-99.9% of reads at the initial filtering step and 8 samples that lose 99% of reads at the merge step. Oddly enough I'm not seeing any loss at the chimera step. I'm also apprehensive about using --p-trunc-len with ITS data as the DADA2 ITS Pipeline Workflow does not recommend truncating the reads at a specified position. ITS sequences have huge variances in length, so trimming at a specified position will cause the longer reads to be filtered out.
The very high variability of read loss at both the filtering stage and at the merging stage is not something that I've seen before, so I'm going to have to speculate a bit.
My first thought is that this is an ITS length-variability issue. Perhaps samples dominated by short amplicons are reading through into the opposite primer, adapter and beyond, and resulting in very low quality reads that are being removed by filtering. Meanwhile, other samples with long amplicons are failling to merge because the reads don't overlap.
This leads into another question from my end, is there a current recommended Q2 workflow for ITS amplicon data? In the DADA2 R space with have our ITS workflow, which uses cutadapt to remove primer and truncate reads prior to the main DADA2 workflow. We also often recommend that folks with intractable merging problems (which can arise if the amplified part of the ITS often exceeds the total length of the forward+reverse reads) consider using forward reads alone to avoid the merging issues. There was a fungi-ITS-specific paper that indpendently described this same R1-only approach: Redirecting
All that said, is there any pattern you can see in the types of samples that are either being lost mostly at filtering, or lost mostly at merging?
So I tried @Mike_Stevenson's suggestion and it somewhat helped for the short reads, but the longer reads were still not passing the initial filter, so that seemed like a dead end. Instead in a fit of trying different things, I increased the --p-trunc-q to 20.
Somehow that helped, and now most samples are having 60% of reads pass the initial filter, with a few of the samples having just 20% pass. It's an improvement, but I don't quite undestand why that helped. And for 18 out of 35 of the samples, I'm still losing a good chunk to the merging step.
To answer you @benjjneb, I can't see any sort of pattern in the samples. I know that in the past when I would trim/truncate ITS data I would lose all of the long reads, due to the region's high variability. That's not what's happening here though. I do like your suggestion of using forward reads only. I'll read the article first before I commit to it.
I'm in the process of running the data through DADA2 in R. I'm hoping that the additional functionality of DADA2 R will allow me to isolate the problem. I'll keep everyone updated on how that goes.