Looking for help joining paired-end reads

Hi Mike,

I tried to trim off the primers with cutadapt. Here's my cmd:

qiime cutadapt trim-paired
--i-demultiplexed-sequences csv18-pair-end.qza
--p-front-f GTGYCAGCMGCCGCGGTAA
--p-front-r CCGYCAATTYMTTTRAGTTT
--p-error-rate 0
--o-trimmed-sequences csv18-pair-trim.qza
--verbose

I got something slightly different but the length didn't change.
Before: csv18-pair-end.qzv (313.8 KB)
After: csv18-pair-trim.qzv (319.1 KB)

You'll want to add

 --p-match-adapter-wildcards \
 --p-match-read-wildcard \
 --p-discard-untrimmed \

to your command. The first two allow matches to IUPAC ambiguity codes (e.g. N, M, R...) while the last discards any pairs in which both primers are not found. This is why there are two drop-offs at the end of the quality plots, some are not being trimmed.

-Mike

Nice! Now it works. There are some scattered points in the reverse reading csv18-pair-trim2.qzv (317.4 KB)

That should be fine @Hui_Yang. These are just a few spurious long, and likely off-target, amplicons in your data. Which is normal. If you :computer_mouse: over them you’ll see they are low-count.

-Mike

Awesome, thanks.

I just ran another round of DADA2, got 259 features.mrdna-tm2-tab.qzv (421.3 KB)

Also tried to join the trimmed ends for Deblur, using vsearch:
qiime vsearch join-pairs
--i-demultiplexed-seqs csv18-pair-trim2.qza
--o-joined-sequences trim2-joined.qza

Got a read of two ends stitched directly, no signs of overlap. trim2-joined.qzv (304.9 KB)

Looking at the provenance, you forgot to adjust the truncation length of my initial suggestion of fw:268 and rev:211 by subtracting the length of the primer. So, your new truncation lengths, after running cutadapt, should be something like:

fw : ~ 250
rev : ~ 190

1 Like

My apologies, still fixed on the thought that trunc length is applied after trimming.

Here’s another round following your suggestion and subtracted the primers: FW: 268 | 283 & REV: 211 | 243
–p-trunc-len-f 249 --p-trunc-len-r 191, --> 257 festures
–p-trunc-len-f 264 --p-trunc-len-r 223, --> 265 features

Best

The trim length setting only applies when running DADA2 / deblur. Remember you ran cutadapt to remove the primers as a separate prior step. So the sequence are already shorter prior to running DADA2 / deblur. :slight_smile:

Does that make sense?

-Mike

Yes thanks for clarifying!

Hi @Hui_Yang, @Mehrbod_Estaki just reminded me, the trunc params are applied after the trim params - I have updated my post above to reflect this.

1 Like

Gotcha. Thanks for clarifying:) I think that just brought me back to my initial questions:

Why didn't my reads merge despite the overlap? When I use vsearch, it simply stitch the two reads together instead of merging. I want to make sure it is not something wrong with my sequences.
Because of that, now I wonder if DADA2 truly joined my reads or did the same thing, and is that why I lost about 40% of my features when denoising paired end reads?


If it helps, here's a brief recap of what I did so far:

  • Imported single end - mrdna-r1-forward.qzv (290.4 KB) and paired end - csv18-pair-end.qzv (313.8 KB), used cutadapt to trim off the primers - csv18-pair-trim2.qzv (317.4 KB)

  • Used vsearch to merge the pair end reads - csv18-joined.qzv (300.6 KB), and the trimed reads too - trim2-joined.qzv (304.9 KB).

  • Attempt to denoise with DADA2, params and feature counts:
    Single end:
    Trim: f = r = 20, Trunc len: f = r = 240 -- 396 features
    Pair end:
    Trim: f = r = 10, Trunk len: f = r = 220, -- 294 features
    Trim: f = r = 10, Trunk len: f = r = 240, -- 254 features
    Trim: f = r = 20, Trunk len: f = r = 180, -- 48 features
    Trim: f = r = 20, Trunc len: f = r = 260, -- 255 features
    Trim: f = r = 20, Trunc len: f = r = 290, -- 207 features
    Trim: f = r = 20, Trunc len: f = 220 r = 200, -- 42 features
    Trim: f = r = 20, Trunc len: f = 240 r = 220, -- 274 features
    Trim: f = r = 30, Trunk len: f = r = 150, -- 44 features
    Trim: f = r = 0, Trunc len: f = r = 280, -- 205 features
    Trim: f = r = 0, Trunc len: f = r = 295, -- 212 features
    Trimed pair end (trim = 0):
    Trunc len: f = 249 r = 191, -- 257 features
    Trunc len: f = 264 r = 223, -- 265 features

Let me know if you want me to upload any table or status files.

Thanks

@Hui_Yang, can you provide explicit sequence examples where you think the reads are simply being stitched together end-to-end? This should not be the case, especially as there must be a minimum overlap of 10 bases for a successful merge in vsearch (default), unless you are changing this value. While DADA2 requires a 12 base overlap, (currently cannot be altered via the dada2 plugin at the command-line interface).

What would be most helpful is the output from the --o-denoising-stats of DADA2 or --o-stats from deblur. These outputs will inform you of where most of your data is being lost. That is, mergning, denoising etc...

-Mike

I did not look into the sequences. I have the impression that they were stitched because when I joined two 300 nts reads csv18-pair-end.qzv (313.8 KB) I got a 600 nts read; and when I joined two 280 nts reads csv18-pair-trim2.qzv (317.4 KB) I got a 550 nts sequence. I thought they were supposed to merge into a ~400 nts piece (515f - 926r).

As for the stats, I lost a lot of data during merging and chimeric check. Here are some examples: mrdna-rj-stat.qzv (1.2 MB) mrdna-rll-stat.qzv (1.2 MB) .

Best

@Hui_Yang,

Always try and confirm what is happening before making assumptions. This makes it harder for us to help you.

The sequences in question are not due to the sequences are being stitched. As I mentioned previously, these are likely spurious off-target sequences that happen to be a different length compared to your primary amplicon target. This is commonly encountered in sequencing runs. Remember, those slightly longer reverse reads we discussed?

I would not be surprised if some of the merged sequences are from these. Also, you can obtain shorter merged sequences too. Anyway, these spurious reads can be removed through a variety of downstream processing steps within QIIME 2.

It does look like you are losing quite a bit of reads. You may have to play around with more truncation parameters. You can also simply ignore the reverse reads and process the data with just the forward reads. I've had many data sets in which the revers reads were so poor, that I could not use them.

Thank you so much!. I’ll give it a few more rounds of testing, and if it doesn’t work, I will proceed with forward reads.

For the purpose of learning and future reference, how do I check if my reads were joined properly? What downstream processing steps did you mean?

Best

That is a harder question to answer, there are many things you can do to quality check your data.

In short, work through as many of the tutorials as you can. You'll eventually get a feel for what to consider.

1 Like

All very true. Thanks!!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.