I’m working with some sub par reads and I’m struggling with justifying my trimming locations and quality score cut offs.
Data background: 16S data from a MiSeq run, the paired end reads should cover the V3-V4 region using 341F, 805R primers as recommended by Illumina. This supposedly gives a ~460 bp amplicon with 140bp overlap. I imported the demultiplexed fastq files into qiime in PairedEndFastqManifestPhred33 format and created a demux summarize .qzv artifact, dropbox link here
My initial thoughts were to use a min20 quality score cutoff point and looking at the medians. This puts the trimming point of the Forward reads at ~268, and Reverse reads at ~221. If my math is correct, this leaves a 27bp overlap still and I believe DADA2 recommends a minimum 20bp overlap for merging.
So my question then is, would I be better off keeping the above parameters, albeit not very good anyways, or lower my quality score cut off to say 15 and retain longer sequences? I’m wondering which of the 2 factors is more important for denoising or if there is a sweet spot in balancing the two factors. Finally, at what point in low quality reads world would using just the Forward reads perform better than forcing low quality paired-end reads?
Thank you in advance and I’m really enjoying the dada2 integration in qiime2.
Thanks for the link! Your sequence quality scores definitely start tanking after ~100 bp on both directions which is a bummer.
I would be as conservative as possible while still maintaining overlap. The quality scores at the ends of each direction impact the results of denoising. The noisier your data is, the harder it is to build an accurate error model and the more likely it is that low abundance variation will be mistaken for error or for your reads to fail merging. (@benjjneb, please correct me if I’m wrong.)
As an aside, I would also consider setting trim_left for the forward reads, as you have a pretty dramatic dip in quality before position ~20.
That’s a great question which doesn’t have a great answer. I would encourage you to run it both ways and see which results in more reasonable frequency of features. This is actually a pretty common situation, so you can just pass your paired-end data into denoise-single without changing anything.
Thank you! I thought that might be the case, just familiarizing yourself with the data and customizing it to answer the original question. Seems like a common theme around here:P And yes, I certainly noticed that initial dip on the forward read too and I think I’ll trim that as well. I just didn’t want to distract from the main concern regarding retaining more or less of overlapping region. I’ll play around with the different options.
To add one thing to what @ebolyen said: The key thing with the amount of overlap to keep, is that it must be at least 20nts + the natural length variation of the amplicon. (Warning: made up numbers ahead). For example, if the average V3-V4 amplicon is 460nts, but can naturally be as short as 450nts, then you need 20nts+10nts of overlap (in the average case) to maintain 20nts of overlap in the case of the shortest natural amplicons.
As long as you have that minimum amount of overlap, its usually better to truncate off sequence tails after the quality has crashed (if it does).