I've just started recently using QIIME2. I'd just like a reality check for myself regards the denoising step in dada2. In total I have 96 samples and overall I think the quality of the reads look pretty good. They are 300PE reads and we have amplified the V3-V4 region. Please see below.
I am just playing around with the truncation and trimming options in dada2 and am trying to figure out the optimum. I've read a few other threads on the forum and based on these have played around with truncation lengths and trimming and have viewed (in a very non-scientific way) the impact these have on the number of reads that pass the initial filter and are subsequently labelled as Features.
I initially set a truncation length of 270 for both forward and reverse reads
I have played around with alternative truncation lengths but in general I think that the best results I see are when I don't truncate at all. Again I had read some posts that recommended as long as the median score for a particular position had a Phred score >30 it was recommended to retain that position.
I know this is an incredibly general and subjective question but based on the number of reads that passed and the resultant Feature table generated would I be right in assuming that I can proceed with the denoising step that does not truncate reads? I have also played around with overlap values and am not seeing major differences
I guess the thing that has surprised me is that we also have ITS data for this data and the number of reads that pass the dada2 filter and result in Featues is much higher than what we observe for the 16S data (see below). We used ITSxpress to trim reads before passing into dada2. I guess this is just a difference in the length of the amplicons?
Thanks for the quick reply, much appreciated. I just wanted to make sure I wasn't doing something unusual before proceeding. I probably will at this point but I am kind of surprised that such a high proportion of the reads can't be merged as I assumed the V3-V4 regions to be ~460 nucleotides. After trimming I think the majority of our reads are 280nuc so I calculated a rough overlap of ~ 280F + 280R - 460 (amplicon) ~ 120nucleotides.
I think what is confusing the issue is that the sequencer supplier provided Tags along with Raw and Clean reads. The number of Tags per sample is very high 60K+ and the average length of these tags is ~416 nucs. They stated that If paired-end reads overlap with each other, then a consensus sequence will be generated using Fast Length Adjustment of Short reads, (FLASH v1.2.11). Minimum overlapping length:15 bp and mismatch ratio of overlapped region:<= 0.1.
I was thinking of importing these tags are SE reads and then running them through dada2. to compare the number of Features at the end. Not sure if this is a sensible thing to do?
Thanks for this suggestion. Have to say this one has me scratching my head. I assumed you could never have too much overlap . In an ideal wold if one has paired end reads with every position receiving a large Phred score of >35 (for example) I'm guessing this wouldn't be an issue?
I followed your suggestion and tried an array of different truncation lengths (I won't bore you with all of them). As a rough rule of thumb, I kept an eye of the samples receiving with largest smallest number of features (CF12 & K17) respectively as well as the overall number of features merged in the 96 samples.
following your suggestion of --p-trunc-len-f 260 --p-trunc-len-r 220
CF12=35206, K17=12105, overall 1,931,093 so a large increase in the CF12 sample stats-dada2_260_200.qzv (1.2 MB)
I played around a little with --p-trunc-len-f 270 --p-trunc-len-r 180
CF12=39023, K17=13926, overall 2,272,248 stats-dada2_270_180.qzv (1.2 MB)
So take home message truncating the reverse reads had the biggest impact.Thanks for the suggestion again and I'm happy to proceed with these denoised reads at this point as I think they have improved my initial merging strategy.
For the DADA2 method, truncating is done first, then reads are removed if their cumulative expected error is too high. This means that shorter reads have fewer expected errors, so more will pass filter.
This is counterintuitive. In other programs, more overlap is more good. But not for DADA2.
Great! This is expected.
Feel free to open another thread if you have more Qiime2 questions