Sidle for multiple partly overlapping regions?

Hi, i have successfully used Sidle for a dataset created with the 5R primers (Nejman et al. Science 2020) these primers amplifies 5 non-overlapping genomic regions (each covering a 16S variable region).

I recently got a dataset created with xGen 16S v2 extraction kit which used 23 primers to amplify 6 partly overlapping long regions (together covering all 9 16S variable regions). Can i use sidle for such data where the amplified regions can be highly overlapping? (eg. some amplicons cover 16S V6-V8 while other cover just V7-V8)

best,
Kristian

Hi @Kristian_Holm,

I dont think it should be an issue. The tutorial data uses forward and reverse reads from the same relatively short region. (I think there's maybe a 50-60nt overlap).

You will still need to extract and align each region on its own. The reconstruction algorithm that's used to make the database map and "untangle" the reads is sort of uninterested in the actual relative primer positions: it just wants to know if two reads share the same space over the entire X regions where they're found. If they have any region where the read is unique, or if a read switches partners, then the read is asssumed to be unique. But, again, that doesn't say anything about region overlap.

If it doesn't work, please let us know so we can fix the issue!

Best,
Justine

Hi Justine, thank you for answering!

A potential problem with extracting the regions from the database-seqs is that there are used up tp 6 slightly different primercombinations for each region with this kit. so if i do 6 different 'qiime feature-classifier extract_reads' and merge the results into a regionX.fasta, i will possibly have several extracted sequences with the same seqid in that file. I guess one db-sequence should only contribute one time to each region-file?

best,
kristian

Hi @Kristian_Holm,

But, that's not how Sidle works.

There's an over view figure in the preprint FWIW, becuase it's big.

Figure 1 of the sidle manuscript showing a schematic of the process. Panel A show database preperation by removing degenrates and then extracting reads for each of the I regions. Panel B shows that for the ith region, the demultiplexed sequences are denoised and then aligned to regional database to produce a regional alignment. Panel C shows combinging regions using the regional database, regional alignment, and regional ASV tables which produces a reconstructed table and reconstruction map. The map and database taxonomy are combined to produce reconstructed taxonomy. Panel D is building a tree uisng the reconstruction map,  and an aligned version fo the database to reconstruct fragments. These get inserted into a scaffold reference tree.

(From Debelius et al, biorxiv

Essentially, the algorithm does the alignment on a per-region basis. The extracted regional sequences should never be merged into as single file. You extract your regional reads using extract-reads, and then the prepare reads function gives your regional reference sequences and that regional read map. These get kep seperate on a per-region basis through otu processing.

The algorithm assumes you've split your sequencing data into regional batches before denoising. I tend to use cutadapt because I try to keep my primers in place, but there are a bunch of different ways you could do this.

The reference sequences then get aligned to your per-region ASVs. This should be a one-to-one region mapping. The regional reference sequences are never merged into a common file, nor are the regional ASV sequences. Or at least, they're not really intended to be.

The point where stuff comes together is the database reconstruction step. That's where those read maps are brought together, and the database gets "solves". There are no DNA sequence representation in this step, just a file that maps the local ID to the database ID and the region.

The algorithm tries to figure out if the sequence IDs are shared or unique across multipe regions

As a simple example, if I have 6 sequences in my database and two regions, the regional and final mapping might look something this:

Reference Sequence Region 1 Region 2 Final Name
Seq1 Seq1\Seq2 Seq1\Seq1 Seq1\Seq2
Seq2 Seq1\Seq2 Seq1\Seq1 Seq1\Seq2
Seq3 Seq3 -- Seq3
Seq4 -- Seq4 Seq4
Seq5 Seq5\Seq6 Seq5 Seq5
Seq6 Seq5\Seq6 -- Seq6

The database and alignment information gets put intot he expectation maximization algorithm that gives you a final table.

So, again, at no point do the regional reads touch each.

...Hopefully this makes sense?

Best,
Justine

Hi justine,
thanks for great overview! But i didnt explain myself clearly enough. I meant that the kit has several different primer combinations for each genomic target area (e.g it uses 6 unique forward primers and 1 reverse to amplify V6-V8, the same scheme is true for the other genomic target areas as well):

4.1
CAACGCGAAGAACCTTACC...TTGTACACACCGCCCGTC
4.2
ATACGCGAGGAACCTTACC...TTGTACACACCGCCCGTC
4.3
CTAACCGANGAACCTYACC...TTGTACACACCGCCCGTC
4.4
CAACGCGAAAAACCTTACC...TTGTACACACCGCCCGTC
4.5
CAACGCGCAGAACCTTACC...TTGTACACACCGCCCGTC
4.6
ATACGCGARGAACCTTACC...TTGTACACACCGCCCGTC

I could not use 'qiime feature-classifier extract_reads' as i can only input one primer combination per region there, but i solved it by using "linked adapters" with cutadapt instead and placing all the combinations for one region in a fasta file. and then repeat for the other regions. Thereby merging reference seqs having primercombination from the same region, but keeping the regions separate:

for i in {1..6}; do
cutadapt -e 0.1 --no-indels --overlap 16 --discard-untrimmed -m 1 -g file:primers_linked_region${i}.fa -o silva-138.1-ssu-nr99-seqs-derep-uniq_degeneratefiltered_taxfiltered_region${i}.fna silva-138.1-ssu-nr99-seqs-derep-uniq_degeneratefiltered_taxfiltered.fasta.gz
done

best,
Kristian

Hi @Kristian_Holm,

In that case, you may have an issue. A unique primer sequence for each extracted region is a pre-requisite for the algorithm. The mixed reference sequences will likely fail, either in the prepare-region step, or when you're solving the database.

You could try to collapse your primer to a single sequence with degenerates if the primers target the same position. If they're staggered, maybe you could choose the most extreme primer. But, you need a single regional representative to use for extraction and as a lure.

Best,
Justine

Hi Justine,
i just chose one of the primersequence pairs for the parameters in prepare-region. As the primer combinations target the same position, and they are no longer part of the extracted sequences nor the trimmed reads i thought they were just used as "names"? When using primers with degenerates the algorithm does not know which non-collapsed primer seq belongs to each ref seq either?

best,
kristian

Hi @Kristian_Holm,

The degenerate primers get extracted once with extract reads. I'm not sure if cutadapt allows for mismatches (extra reads allows 2 by default, which is less than happens biologically, but good enough for a first order approximation.) My concern is that your concatenated file may have the same sequence appear multiple times without degeneracy notation. This would throw off reconstruction.

The second is that the primer does appear one more time. If you're planning on doing fragment reconstruction for ta tree, the primer gets used to build the fragments.

Best,
Justine

Hi Justine,

Yes, Cutadapt allows mismatches. And when using several primer pairs targeting same region as "linked adapters" with cutadapt, it will only use the best matching primer pair to extract the sequence, so each ref seq will only be exist once per region, no need for concatenation.

So then hopefully it is just the tree building that will be an issue... i am trying to run the very memory intensive reconstruct database step now so if that works i can compare with previous results as these samples have been processed before using standard 1 region method.

Best,
Kristian

1 Like

Hi @Kristian_Holm,

Thanks for clarifying. Please keep us updated about how it works.

Best,
Justine