Differences in paired-end and single-end import of FASTQ files

Hello,

I wanted to ask a question regarding differences I am seeing between my single end and paired end import in Qiime2. I realize that it could be multiple variables involved, so I will list out everything I can. Please let me know if you need more details to understand.

I imported fastq files forward read via single-end import (SingleEndFastqManifestPhred33V2) and combination of forward and reverse reads via paired-end import (PairedEndFastqManifestPhred33V2).

The single-end import went through the pipeline as described on the Moving picture tutorial. I used Deblur to denoise and trimmed at 246 bp. I did the maximum amount possible since the bottom of the box didn't fall below quality score of 20 till the very end.

For the paired-end import, I joined the ends using vsearch join-pairs. Then did Deblur to denoise and trimmed at 435 bp. It looked like the reverse reads had worse quality, but the demultiplexed sequence summary said that most of the reads are going to be at least 439 nts (after joining) so I decided to trim there instead of earlier.

After going through the pipeline, the result is quite different. The lowest feature number that I saw for single-end was around 4700, while for paired-end it was 154. Because of the trim length, the taxa-bar is also quite different. bacteria that are identified at level 5 for single-end looks like this:


While the paired-end looks like this:

I'm also struggling to do alpha and beta diversity analysis on the paired-end imports as I have to eliminate samples based on the sampling depth.

Are the differences I am seeing entirely based on the trim length? I'm unable to tell which import is more "accurate" in terms of the results that I am seeing. Are the paired-end imports worse because I didn't trim them early enough (reverse reads looks quite bad)? Or are the paired-end imports more accurate taxonomy wise because there are more of the sequences available? If I should be trimming the paired-end imports earlier (around 250) what is the point of doing paired-end instead of a single-end import?

Please let me know!

Hello Shingo,

Thank you for describing your pipeline in detail. I like your choice of steps and think the settings you have chosen are resonable.

Based on what you have told me, I think the issue is with joining the paired end reads. The quality plot after joining shows reads as long as 450 bp, which would only make sense if your amplicon was also 450 bp long. What region did you sequence and how long do you expect it to be?

(My hunch is that vsearch is not overlapping your reads enough during joining, leading to strange results. Like this

True amplicon |--------------------|
Correct R1    |--------------->
Correct R2        <----------------|

Wrong R1 |--------------->
Wrong R2               <----------------|
Wrong    |------------------------------| result after joining

Let's investigate more! :mag_right:

Hi Colin,

Thank you for your response!

I've had the V3-V4 region of the 16s rRNA sequenced. The primers used should be 341F and 805R, so I think you are absolutely correct to doubt the joining process. I think the sequenced amplicon (~428 bp) should be shorter than what my quality plot says. How should I go about fixing this?

Best,
Shingo

1 Like

OK, I'm glad I asked! At first glance, 805-341 = 464, which is pretty close to what we are seeing in that joined plot, but you know more about your amplicons than I do. If 428 is expected, then we should aim for that.

Check the docs for vsearch join-pairs for all the settings we can change. The setting --p-minovlen controls the minimum allowable overlap and is 10 by default. For you amplicon:
R1 + R1 - amplicon = overlap
250 + 250 - 428 = overlap
72 = overlap

Try using --p-minovlen 50 or --p-minovlen 60 and see if we can get vsearch to join reads to the expected length.