Concatenate R1 and R2 for reads that can't join

I've got some amplicon reads that do not, and should not overlap.

These amplicons are continuous, but the region sequenced is not.
(It's not like the multi-region Ion Torrent kit.)

Some programs support aligning non-overlapping paired-ends reads without joining them.

Vsearch supports --fastq_join in which, "sequences are not merged as with the fastq_mergepairs command, but simply joined with a gap."

|------------------------| Full amplicon
|---------->               R1 
              <----------| R2
|----------nnnn----------| Concatenated (no overlap)

Does Qiime2 support discontiguous reads these days? Looks like it didn't in 2019:

What parts of the pipeline could be configured to support a read with NNNs in the middle?

Is this a use-case for q2-sidle, or is that for seperate amplicons instead of amplicons with gaps in the middle?

1 Like

Hi @colinbrislawn,

I'm trying to work through how this works. So, it's something where you're missing the end fo the reads and cannot overlap them, but theoretically, if say, your sequencing was long enough, you could? Or like, you have an amplicon from the 16S and ITS gene that you're trying to combine?

I think the first case could work with Sidle - the current tutorial actually uses forward and reverse reads from the same primer pair as one fo the regions amplified. Whether or not they overlap is irrelevant to the processing.

There are a few caveats though, in my mind

  1. The quality of the alignment goes down based on the number of sequences in your database that are the same over a region. You might chose to dereplicate your database if you're working in a smaller subset or region.
  2. Sidle's alignment parameters tend to be pretty stringent, and that can cause some weirdness and you may lose reads. I'm still working on an "accounting" function that would tell you how mnay you've lost, and I have so many things I'm trying to do right now.
  3. You will likely want to build your table in "average" mode (default) where the read "depth" is provided based on the average sequencing depth over all regions, which is probably more appropriate if you're merging forward and reverse reads.

If you try it, please keep let us know how it works?

Best,
Justine

5 Likes

Correct. Like a 16S V4 amplicon with 100 bp paired-end reads (50 bp gap).

Thank you for telling me more about Sidle. I'll look into that!

2 Likes

As far as I know there isn't anything in the core plugins still. Outside of Sidle, exposing vsearch's option in the Q2 plugin or even DADA2's justConcatenate in mergePairs would be the easiest Q2 adaptation in a future release I would think.

2 Likes

Correct

This would be easy and probably less likely to cause issues downstream than some other options (it would break dada2 and break classify-sklearn but would work for OTU clustering and alignment). One hang-up is to specify semantic types accordingly to prevent passing this to dada2 etc.

this has been discussed a bit on the forum and off — it could lead to serious issues downstream with taxonomy classifcation, alignment/phylogeny building, and maybe elsewhere. So exposing it would be easy, but this is not advisable (unless if a new semantic type is created for such an output, which would not be compatible with many downstream steps).

This is something that has been on my radar to look into for quite some time, so maybe later this year I can explore if others do not beat me to it :grin:

Sounds like a promising secondary use case for q2-sidle!

4 Likes

100% @Nicholas_Bokulich !

This is how I saw it in my head as well

3 Likes