This is a draft Community Tutorial.
- The command
qiime alignment sina
may not yet be available in your installation of Qiime2 - Comments and suggestions would be greatly appreciateed
Aligning sequences to match a reference alignment
Background
Traditional, de novo methods mututally align a set of unaligned sequences to create a multiple sequence alignment (MSA) from scratch. Re-running these methods with additional sequences will create MSAs with varying numbers of columns and assignments of bases to each column. These alignments is therefore incompatible with one another and may not be joined through concatenation.
Reference based alignments on the other hand are meant to add sequences to an already existing alignment. Alignments computed using reference based alignment tools always have widths identical to the reference alignment and maintain the meaning of each column. Therefore, these alignments may be concatenated.
The three main rRNA reference database projects (Greengenes, RDP and SILVA) each use a reference based alignment tool (PyNAST, Infernal and SINA, respectively) to compute their vary large reference alignments. They do this because reference based alignment scales very well, allowing massive alignments to be built, and because it allows for continuity and curation of the alignment. The extended alignments can then be used to extend existing trees, allowing continuity in the taxonomic curation of the reference phylogenies.
In the context of Qiime2, you may want to choose a reference based alignment method
- if you have an existing alignment and tree and want to add sequences
- if you want to make use of a curated reference alignment
Aligning sequences to match a reference alignment
In addition to alignments imported into Qiime2 as usual, the sina
subcommand of qiime alignment
can make use of reference alignments in arb
format directly. This is recommended for large reference alignments only, typically when using a SILVA database. While using arb
files directly saves significant time for large reference databases, it currently breaks provenance tracking.
Let's start with a simpler, if contrived, exampe:
A) Using a reference alignment in "qza
" format
Suppose we needed to have the representative sequences used in the otu-clustering tutorial aligned to match the alignment computed in the moving pictures tutorial. We could proceed with the qza
files we already have as follows:
Input: reference alignment:
aligned-rep-seqz.qza view | download (MaFFt aligned sequences from moving pictures tutorial)
Input: unaligned representative sequences:
rep-seqs-dn-99.qza: view | download (Dada2 representative sequences from otu-clustering tutorial)
Run SINA:
qiime alignment sina \
--i-sequences rep-seqs-dn-99.qza \
--i-reference aligned-rep-seqs.qza \
--o-alignment rep-seqs-dn-99-aligned.qza
Output: aligned representative sequences:
The output file contains aligned sequences just as the reference. It can now be passed into methods from q2-phylogeny
or other modules requiring aligned sequence data.
B) Using a reference alignment in "arb
" format
The current SILVA RefNR SSU database is a little large for a tutorial. Let's use the most recent version of the Living Tree Project (publication) reference database also available as arb
file as reference.
Input: reference alignment:
LTPs132_SSU.arb: download
As input sequences we will be using the same data as in example A.
Run SINA:
qiime alignment sina \
--i-sequences rep-seqs-dn-99.qza \
--p-arb-reference LTPs132_SSU.arb \
--o-alignment rep-seqs-dn-99-aligned.qza
Note that this command will take a little longer the first time it is run as it is generating a set of index files used by the ARB PT server. Using the LTP SSU dataset comprising only the 13,903 type strain 16S sequences the indexing process should take less than a minute, but for reference databases comprising hundreds of thousands of sequences, significant amounts of memory and time a required.
Output: aligned representative sequences:
Moving on
You can now proceed with the aligned sequences as described in the q2-phylogeny community tutorial, starting at the alignment masking step.