quality plot for merged reads/details for joining paired reads

Qingqi · August 12, 2020, 5:59pm

Hello,
I am confused with qiime2 merging the paired reads.
PS, the sequence is from 16S v1-v3 region.
Here is the quality initially imported, DID_import.qzv (293.8 KB)
Then I merged the paired reads by
qiime vsearch join-pairs --i-demultiplexed-seqs import.qza --o-joined-sequences join.qza
Here is the quality plot for merged reads DID_join.qzv (297.6 KB)
See quality of imported reads, the merged reads should have a lower quality in the middle, as the middle part are the ends of R1 and R2. But, the plot showed totally different. I also posted out the plot of quality control(command: qiime quality-filter q-score-joined) after merging quality-control. No too much changes compared with the joined ones.DID_filter.qzv (301.4 KB)
Can someone give me an hit for that? Really appreciate~

SoilRotifer · August 12, 2020, 7:33pm

Hi @Qingqi,

The quality of the imported reads are really poor. I would suspect something went wrong with the preparation of the samples prior to sequencing, or something went awry with the sequencing run.

When merging reads with tools like vsearch, the quality scores often increase, not decrease, in the region of overlap. The reason being, if the forward and reverse read call the same base at the same position (even if both are low quality), then the quality estimate for the base in that position goes up. That is, you have two independent observations of that base in that position. This is often the benefit of being able to merge paired reads, as you can recover / increase your confidence of the sequence in the region of overlap.

-Mike

Qingqi · August 12, 2020, 9:00pm

Hi, Mike
Thanks for your reply, you did do me a great favor.
The whole sample set sequenced not only once, the sequence quality of same sample varies much. I pooled them together, so the average quality of position is poor.
Based on your explanation, i still have some points puzzled:
1> the quality of end of the joined reads dropped dramatically, while it is pretty good showing in imported reverse plot. how to understand this, is still artificially changed?
2> To identify the paired reads, it is based on the identifier in the sequence header, then we should have R2 reads as many as R1 in the joined reads, but why in the last part of the joined reads, the number of sampled reads reduce with the length growing?
Thanks,
Qingqi

SoilRotifer · August 12, 2020, 9:32pm

I am not sure what you mean by this. The data should be processed for each sample separately, even if they were re-run, as you may have "per-run" sequencing biases.

This is likely due to the fact that we are observing a different population of reads. That is, only a fraction of the reads 1,314,874 of 3,291,173 were merged.

This is the idea behind merging the reads. That is, you align the matching overlapping regions of R1 and R2 in order to generate a longer sequence. Given that the quality of your reads is poor, there will be increased mis-matches in this region of overlap, causing the merging process to fail. Hence, fewer sequences in your output. If you search the forum and elsewhere you'll find additional content related to merging paired-end reads. You can read about how vsearch merges paired-ends here.

-Best wishes!