Question about dada2 denoise-paired analysis

agan · March 24, 2017, 4:19pm

Dear Qiime2 Masters:
I analyzed our 515-806 sequence data using both single-end and paired-end read analyses of qiime2-2017.2.
At the demux steps, both analyses generated the same demux.qzv file, and the same demux quality plots for read1.
After the denoise-single (read1: trim-left 5, trunc-len 250) and denoise-paired (read1: trim-left 5, trunc-len 250; read2: trim-left 5, trunc-len 150) trim/trunc and following visualization analyses, I notice that most of the OTUs generated in the rep-seqs.qzv of denoise-paired analysis are only 245 bp. And the several merged sequences are mainly eukaryotic sequences. The number of features (OTUs) decreases from 750 (single) to 566 (paired) as shown in the table.qzv files of the two analyses.
I am wondering if the low merge ratio is caused by the quality of our sequence data, or we should only use the single-end analysis in current version.
Pleases find attached the pertinent files:

demux-single.qzv (164.7 KB)
demux-paired.qzv (164.5 KB)
rep-seqs-single.qzv (251.2 KB)
rep-seqs-paired.qzv (224.7 KB)
table250.qzv (298.9 KB)
table-paired.qzv (296.5 KB)
demux-qual-plots-paired.qzv (2.9 MB)

Thanks a lot for your help!

Ganyu

benjjneb · March 27, 2017, 3:32pm

Hi Ganyu,
Given the trimming parameters that you are using, the reverse reads are (in almost all cases) completely overlapping the forward reads. Thus, the merged reads simply are the forward reads, as the reverse read has added no new nucleotides. The 245nt sequences you are seeing in the merged output are merged, and the few that are longer than 245 are instances when gene regions much longer than the expected target were amplified by these primers, which is often associated with eukaryotic sequences as they are highly evolutionarily divergent.

When the F/R reads are completely overlapping, the number of merged features will necessarily be less than the F reads alone, as dada2 will require the inferred F/R sequences to exactly match in the overlap region. This is due to culling out some residual errors, but also due to losing some (almost always rare) real variants that were detected in the F reads but not in the R reads because of the lower quality of the R reads.

While choosing to use F reads alone is a valid approach, I recommend using the R reads as well as long as the quality of the R reads is not too low. But when there is complete overlap, the choice boils down to minimizing the FP rate vs. being a bit more sensitive to rare variants.

-Ben

agan · March 27, 2017, 10:45pm

Thanks a lot for your reply, Ben.
I am still confused about the total length of the amplicons. Since we used the 515F and 806R primers, I think the total length of our sequences (read1 + read2) should be about 806-515+1=292 bp. So after merging read1 and read 2, I thought about 45 new nucleotides (after paired-end analysis) would be added to the features (OTUs generated by single-end analysis)…
Thanks again for your help!

Ganyu

benjjneb · March 28, 2017, 3:23am

515 and 806 are the positions at which the primers start. Both primers are about 20nts in length (17/21 maybe?), so chop about 40nts of primer sequences off your calculated length, and the length of the amplicon sequence between the ends of the 515F/806r primers is about 250 nts (just over I think, maybe 253 on average).

So if you trim nothing, you will get amplicons in tight length distribution around ~253 nts.

agan · March 28, 2017, 12:47pm

Thanks a lot for the explicit explanation!
We will try to use the 515f/926r primers to obtain longer sequences. Thanks.

Ganyu