DADA2 failed to merge paires

Yoha · July 30, 2021, 8:26pm

Hello, I am a new dada2 user, and on the MiSeq platform, we sequenced the bacterial 16S rRNA coding gene in paired end mode (2 X 300 bp) (Illumina). The sequencing platform had already removed the primers and barcodes. There is no information on the primers that were utilized.
I started processing it with dada2, and after 220 bp and 120 bp for forward and reverse, respectively, the quality profile started to drop. I began by setting the truncL(220, 120) trimleft (10,10). 70-90 percent of readings were retained after the filtering phase, but no matched reads were combined. After that, a variety of truncL settings were tried, but none of them produced paired reads.
There was no merging regardless of different truncL and trimleft parameters used.
When truncL(0,0), trimleft(0,0), trunQ=2, maxEE(2,2) were applied, however, only 10 - 16 percent of reads were maintained after filtering, and roughly 80 - 90 percent of the reads were merged. After that, we experimented with loosening the quality criteria. First, truncQ=3 with null values for truncL and trimleft resulted in the same percent of reads kept after filtering (10 -16%) as truncQ=2 with null values for truncL and trimlef.
After then, 80-90 percent of the reads were combined. This demonstrates to me that no read is being deleted due to a poor local quality score??? Then we began to increase the maxEE (up to 5,5), which resulted in roughly 50 reads being kept after the filtering stage, and about 80 – 90% of reads being merged.
I'm puzzled as to what I should do with the reads for downstream processing??
What would you recommend in this case, and how can I get a high percentage of reads combined while deleting the low-quality ones?
I'm considering solely using the forward primer, however this will reduce the sequence resolution.
I tried pre-merging a few samples with pandaseq and processing the merged reads as a single read, however pandaseq's output is fasta rather than fastq. As a result, I am unable to plot the quality score and I am not sure about the following steps.
As an example, I've included a fastq file (forward and reverse read) of one sample (link provided below)
https://drive.google.com/drive/folders/1weRunOS0PJbO8IVMnkmOaZufB4uwyVBf?usp=sharing

colinbrislawn · August 3, 2021, 1:51am

Hello Yoha,

Thank you for giving us lots of details about your data set and what you have tried. That's very helpful.

When merging paired end reads, I want as many of my reads to join as possible and try to find the settings that do this best. When I'm quality filtering my reads, I'm making a trade off between keeping more data or allowing lower quality data into my analysis. There is no perfect answer to this question.

Would you be willing to post the interactive quality plot that is the output of qiime demux summarize? I'm puzzled too, and I think seeing the read quality would help us understand how length trimming and EE filtering is changing the number of reads that merge and pass filter.

You could post a screenshot, or a full .qzv file. It should look something like this:

Yoha · August 3, 2021, 9:47am

Hello Colin,

Thank you for taking the time to answer to my query.
here is a link to an interactive quality plot of my samples (I'm working on a subsample of five to save processing time).
https://view.qiime2.org/visualization/?src=ce4cb6f8-44ab-4e77-8ef2-6e9d0fd672c2&type=html
Finally, I have some further information about the primer.
We chose 351F and 926R, which means the amplicon size is 926 - 351 = 575. Since the sequencing was done in 2 x 300 bp, the overlapping region is computed as follows:
The overlap is (300+300) - 575 = 25 bp.
Thus, the overlap is only 25 bp.

Yoha · August 3, 2021, 10:12am

The above link does not work. Below is the demux qzv: -
demux.qzv (314.1 KB)

colinbrislawn · August 3, 2021, 2:34pm

Hello Yoha,

Thanks for posting the qzv file! I was able to open it up and take a look at the quality. With the knowledge of the quality, and the region sequenced, I think we can solve the mystery.

Your math is correct. Which is why the quality plots are so disappointing...

I've labeled that 25 bp region of overlap in this plot below.

As you can see, the quality in that final region is quite low, especially in the reverse reads. This is why they cannot join.

I think this is your best option. The quality of R1 is quite good, and while it's shorter than the full 575 region amplified, there is no good way to join these reads and capture the full region.

This goes back to the trade off between a small amount of good data or a large amount of bad data. While only using the forward read does reduce resolution, the quality is much higher and this will let you keep more of your data after filtering.

Try using just your forward read with dada2 denoise-single and let me know what you find!

Colin

Yoha · August 3, 2021, 6:38pm

Hello Colin,

I understand, and I believe it is a good idea to focus just on the forward read. Thank you for devoting the time and effort to see this through to the end.

system · September 4, 2021, 12:38am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.