Looking for help joining paired-end reads

Hui_Yang · November 9, 2020, 9:16pm

Hi Justine,

I have a question that seems to be relevant to this stream. I imported my raw data (16S V4V5, 515f/806r) using Casava 1.8 paired-end demultiplexed fastq format and they seem fine. Then I try to join them using vsearch command and I don't think they are joined. Here are the qzv files.
csv18-joined.qzv (300.6 KB) csv18-pair-end.qzv (313.8 KB)

I used vsearch with default parameters:
qiime vsearch join-pairs --i-demultiplexed-seqs csv18-pair-end.qza --o-joined-sequences csv18-joined.qza

I also imported R_1 reads as single-end seqs and ran DADA2, which is what I usually do, on both single-end and paired-end seqs, and I got about half features from paired-end seqs than from single-end seqs.

I don't know what is going on. Is this a problem with my sequencing data or some parameters I did not set right?

Thank you

thermokarst · November 10, 2020, 2:26pm

Hi @Hui_Yang!

You should not join your paired-end reads prior to running DADA2 - DADA2 expects the reads to be unjoined, and performs the joining automatically when executing.

Hope that helps! :qiime2:

Hui_Yang · November 10, 2020, 3:38pm

Thanks Matt!

I did not join them when I ran DADA2. When I did DADA2, I imported paired end reads and forward reads separately and went through the pipeline respectively. I got very different feature counts -- about 220 from pair end and 450 from single end (forward).
I didn't know what was going on so I was trying to join them and run Deblur yesterday.

Best
Hui

thermokarst · November 11, 2020, 5:10pm

Let's take a closer look at your DADA2 denoising stats - I suspect your trim and truncation params when running denoise-paired are removing many features. Please share the denoising stats QZVs for both denoise-single and denoise-paired.

Hui_Yang · November 11, 2020, 8:18pm

Gladly. Here are the table and status qzvs:
Single end: mrdna-r1-stat.qzv (1.2 MB) mrdna-r1-tab.qzv (423.3 KB)
Paired: mrdna-rj-stat.qzv (1.2 MB) mrdna-rj-tab.qzv (417.9 KB)

Seems a lot of inputs did not merge.

Here are the imported files just in case: mrdna-r1-forward.qzv (290.4 KB) csv18-pair-end.qzv (313.8 KB)

Let me know anything else I can provide.

Thanks

thermokarst · November 12, 2020, 2:34pm

Thanks for sharing these, @Hui_Yang!

I agree. It isn't the worst I have seen, but there is a lot of reads being discarded during the merge step. As well, it looks like a few of your samples are losing many reads at the chimera filtering step.

I'm pretty sure 515f/806r is V4 only (can @jwdebelius or @SoilRotifer confirm that?), not V4-V5. Can you confirm what your target region is, and what primers you used? These are important to keep in mind when choosing your DADA2 trim and trunc parameters, and they have a direct relationship to the merging step - if you pick too aggressive trim and trunc params, you run the risk of making the reads unmergeable. Also, q2-dada2 needs at least 12 nts of overlap in order to merge. @Mehrbod_Estaki has a few good posts floating around the forum demonstrating how to calculate the overlap, check the "Search" functionality here in the forum for more details.

:qiime2:

SoilRotifer · November 12, 2020, 2:57pm

Hi @Hui_Yang & @thermokarst,

Correct, 515-806 is V4 only. V4V5 is usually something near the 515F-958R range.

-Mike

Hui_Yang · November 12, 2020, 3:24pm

Thank you! Sorry I put the wrong primers, it was 515F and 926R.

I got demultiblexed raw data but I couldn't remove the primers (somehow the tool they provided didn't work) so I trimmed off 20 nts on the left side of R1 and R2. I set trunc len = 220, which should give me about 30nts of overlap, but that may be too short. I'll try longer trunc length.

Best

thermokarst · November 12, 2020, 4:01pm

Thanks for clarifying, @Hui_Yang!

515F 926R is ~411 nts long. Based on the provenance in the files you shared, your trim and trunc params were:

trunc_len_f: 240
trim_left_f: 20

That means that the forward reads were trimmed to 220 nts

trunc_len_r: 200
trim_left_r: 20

That means the reverse reads were trimmed to 180 nts

~~I think that means there is no way the fwd and rev reads can overlap:~~

411 nt amplicon |----------------------------------------------|
240 nt fwd read |-------------------------->
200 nt rev read                    <---------------------------|
29 nt overlap                      ^^^^^^^^^

@SoilRotifer, am I doing this math right?

~~I think you will need to relax your params quite a bit, you need at least 23 nts more to ensure sufficient overlap.~~

SoilRotifer · November 12, 2020, 5:07pm

thermokarst:

I think that means there is no way the fwd and rev reads can overlap:
411 nt amplicon |----------------------------------------------|
220 nt fwd read |-------------------->
180 nt rev read                          <---------------------|
11 nt gap                             ^^^
@SoilRotifer, am I doing this math right?

Yeah that works out!

I think your forward reads look great @Hui_Yang! Based on the demux paired-end output, you should be able to go out to further for each read.

Based on the co-ordinates of the demux qzv file, I'd try the following truncation point values (or combinations thereof) :

FW: 268 | 283
REV: 211 | 243

I'd try the 268 - 211 pair first.

RE

Why not try running q2-cutadapt on your demuxed data prior to dada2 / deblur? You can search the forum for many examples of running cutadapt. Then re-visualize your cutadapt output (quality plots) the same way you did your demuxed data. This will help you determine the appropriate truncation points after the primers have been removed, i.e. they'll likely be 20-30 bp shorter. At this point there is likely no need to use the trim options.

-Cheers!
-Mike

Hui_Yang · November 12, 2020, 6:20pm

Oh I thought trunc len was the length after the trimming. Good to know!

I tested a few params and here's what I got:

Trim: f = r = 10, Trunk len: f = r = 220, 294 features
Trim: f = r = 10, Trunk len: f = r = 240, 254 features
Trim: f = r = 20, Trunk len: f = r = 260, 255 features
Trim: f = r = 20, Trunk len: f = r = 290, 207 features
Trim: f = r = 0, Trunk len: f = r = 280, 205 features
Trim: f = r = 20, Trunk len: f = 240 r = 220, 272 features

Seems trimming didn't make too much difference, and trunc length shouldn't be too long or too short. Is it ok to proceed with the setting that gives the most features?

Hui_Yang · November 12, 2020, 6:21pm

I'll try that. Thanks!

SoilRotifer · November 12, 2020, 6:44pm

Well... this was specifically in reference to my comment about using cutadapt. That is, if you run cutadapt to remove the primers, your sequences will be shorter. So you'd have to re-visualize them.

Trimming the 5' ends will not have an effect on the merging process, specifically. It is the quality and overlap of the 3' ends that matter.

As long as the quality is good, and there is enough over-lap to merge the reads, being "too long" or "too short" won't matter.

Did you try the specific truncation values I suggested? Especially the fw: 268 with rev: 211? There can be large differences just shifting a few bases over.

-Mike

Hui_Yang · November 12, 2020, 8:00pm

Hi Mike,

I tried to trim off the primers with cutadapt. Here's my cmd:

qiime cutadapt trim-paired
--i-demultiplexed-sequences csv18-pair-end.qza
--p-front-f GTGYCAGCMGCCGCGGTAA
--p-front-r CCGYCAATTYMTTTRAGTTT
--p-error-rate 0
--o-trimmed-sequences csv18-pair-trim.qza
--verbose

I got something slightly different but the length didn't change.
Before: csv18-pair-end.qzv (313.8 KB)
After: csv18-pair-trim.qzv (319.1 KB)

SoilRotifer · November 12, 2020, 8:07pm

You'll want to add

 --p-match-adapter-wildcards \
 --p-match-read-wildcard \
 --p-discard-untrimmed \

to your command. The first two allow matches to IUPAC ambiguity codes (e.g. N, M, R...) while the last discards any pairs in which both primers are not found. This is why there are two drop-offs at the end of the quality plots, some are not being trimmed.

-Mike

Hui_Yang · November 12, 2020, 8:16pm

Nice! Now it works. There are some scattered points in the reverse reading csv18-pair-trim2.qzv (317.4 KB)

SoilRotifer · November 12, 2020, 8:27pm

That should be fine @Hui_Yang. These are just a few spurious long, and likely off-target, amplicons in your data. Which is normal. If you over them you'll see they are low-count.

-Mike

Hui_Yang · November 12, 2020, 9:02pm

Awesome, thanks.

I just ran another round of DADA2, got 259 features.mrdna-tm2-tab.qzv (421.3 KB)

Also tried to join the trimmed ends for Deblur, using vsearch:
qiime vsearch join-pairs
--i-demultiplexed-seqs csv18-pair-trim2.qza
--o-joined-sequences trim2-joined.qza

Got a read of two ends stitched directly, no signs of overlap. trim2-joined.qzv (304.9 KB)