Q2-alignment: reference based alignment using SINA

:construction: This is a draft Community Tutorial. :construction:

  • The command qiime alignment sina may not yet be available in your installation of Qiime2
  • Comments and suggestions would be greatly appreciateed

Aligning sequences to match a reference alignment

Background

Traditional, de novo methods mututally align a set of unaligned sequences to create a multiple sequence alignment (MSA) from scratch. Re-running these methods with additional sequences will create MSAs with varying numbers of columns and assignments of bases to each column. These alignments is therefore incompatible with one another and may not be joined through concatenation.

Reference based alignments on the other hand are meant to add sequences to an already existing alignment. Alignments computed using reference based alignment tools always have widths identical to the reference alignment and maintain the meaning of each column. Therefore, these alignments may be concatenated.

The three main rRNA reference database projects (Greengenes, RDP and SILVA) each use a reference based alignment tool (PyNAST, Infernal and SINA, respectively) to compute their vary large reference alignments. They do this because reference based alignment scales very well, allowing massive alignments to be built, and because it allows for continuity and curation of the alignment. The extended alignments can then be used to extend existing trees, allowing continuity in the taxonomic curation of the reference phylogenies.

In the context of Qiime2, you may want to choose a reference based alignment method

  • if you have an existing alignment and tree and want to add sequences
  • if you want to make use of a curated reference alignment

Aligning sequences to match a reference alignment

In addition to alignments imported into Qiime2 as usual, the sina subcommand of qiime alignment can make use of reference alignments in arb format directly. This is recommended for large reference alignments only, typically when using a SILVA database. While using arb files directly saves significant time for large reference databases, it currently breaks provenance tracking.

Let's start with a simpler, if contrived, exampe:

A) Using a reference alignment in "qza" format

Suppose we needed to have the representative sequences used in the otu-clustering tutorial aligned to match the alignment computed in the moving pictures tutorial. We could proceed with the qza files we already have as follows:

Input: reference alignment:

aligned-rep-seqz.qza :qiime2: view | download (MaFFt aligned sequences from moving pictures tutorial)

Input: unaligned representative sequences:

rep-seqs-dn-99.qza: :qiime2: view | download (Dada2 representative sequences from otu-clustering tutorial)

Run SINA:

qiime alignment sina \
   --i-sequences rep-seqs-dn-99.qza \
   --i-reference aligned-rep-seqs.qza \
   --o-alignment rep-seqs-dn-99-aligned.qza 

Output: aligned representative sequences:

rep-seqs-dn-99-aligned.qza: :qiime2: view | download

The output file contains aligned sequences just as the reference. It can now be passed into methods from q2-phylogeny or other modules requiring aligned sequence data.

B) Using a reference alignment in "arb" format

The current SILVA RefNR SSU database is a little large for a tutorial. Let's use the most recent version of the Living Tree Project (publication) reference database also available as arb file as reference.

Input: reference alignment:

LTPs132_SSU.arb: download

As input sequences we will be using the same data as in example A.

Run SINA:

qiime alignment sina \
   --i-sequences rep-seqs-dn-99.qza \
   --p-arb-reference LTPs132_SSU.arb \
   --o-alignment rep-seqs-dn-99-aligned.qza 

Note that this command will take a little longer the first time it is run as it is generating a set of index files used by the ARB PT server. Using the LTP SSU dataset comprising only the 13,903 type strain 16S sequences the indexing process should take less than a minute, but for reference databases comprising hundreds of thousands of sequences, significant amounts of memory and time a required.

Output: aligned representative sequences:

rep-seqs-dn-99-aligned-ltp.qza: :qiime2: view | download

Moving on

You can now proceed with the aligned sequences as described in the q2-phylogeny community tutorial, starting at the alignment masking step.

6 Likes

An off-topic reply has been split into a new topic: When will SINA be available in QIIME 2?

Please keep replies on-topic in the future.

Should this post be removed, given that sina has been dropped from q2-alignment: deps: minpin skbio, drop sina (#72) ยท qiime2/q2-alignment@2bb78e9 ยท GitHub ?

Also the reference to this post at Phylogenetic inference with q2-phylogeny โ€” QIIME 2 2022.2.0 documentation

There are a variety of tools such as PyNAST) (using NAST), Infernal, and SINA, etc., that attempt to reduce the amount of ambiguously aligned regions by using curated reference alignments (e.g. SILVA. Reference alignments are particularly powerful for rRNA gene sequence data, as knowledge of secondary structure is incorporated into the curation process, thus increasing alignment quality. For a more in-depth and eloquent overview of reference-based alignment approaches, check out the great SINA community tutorial).

Thanks for bringing this up @nick-youngblut! This is correct, this method is no longer supported in q2-alignment. I am archiving this post now!