This is a draft Community Tutorial for the q2-phylogeny plugin. This content refers to the
Comments and suggestions would be greatly appreciated. Keep an out for any updates here. Until then, enjoy !
Inferring phylogenies via q2-phylogeny.
Several downstream diversity metrics, available within QIIME, require that a
phylogenetic tree be constructed using the OTUs (Operational Taxonomic Units) or
Exact Sequence Variants (ESVs) being investigated.
But how do we proceed to construct a phylogeny from our sequence data?
Well, there are two phylogeny-based approaches we can use. Deciding upon which to use is largely dependent on your study questions:
- A reference-based fragment insertion approach. Which, is likely the ideal choice. Especially, if your reference phylogeny (and associated representative sequences) encompass neighboring relatives of which your sequences can be reliably inserted. Any sequences that do not match well enough to the reference are not inserted. This may or may not have implications for your study questions. For more information about fragment insertion, check out the great examples here.
- A de novo approach. Marker genes that can be globally aligned across divergent taxa, are usually amenable to sequence alignment and phylogenetic investigation through this approach. This community tutorial will focus on the de novo approaches.
Here, you will learn how to make use of de novo phylogenetic approaches to:
1) generate a sequence alignment within QIIME
2) mask the alignment if needed
3) construct a phylogenetic tree
4) root the phylogenetic tree
Prior to constructing a phylogeny we must generate a multiple sequence alignment (MSA).
When constructing a MSA we are making a statement about the putative homology of the aligned residues (columns of the MSA) by virtue of their sequence similarity.
The number of algorithms to construct a MSA are legion. We will make use of MAFFT (Multiple Alignment using Fast Fourier Transform) via the q2-alignment plugin. The paper for version 7 of MAFFT is here.
Input: unaligned representative sequences from the Atacama soil microbiome tutorial:
qiime alignment mafft --i-sequences rep-seqs.qza --o-alignment aligned-rep-seqs.qza
Reducing alignment ambiguity: masking and reference alignments.
Why mask an alignment?
Masking helps to eliminate alignment columns that are phylogenetically uninformative or misleading before phylogenetic analysis. Much of the time alignment errors can introduce noise and confound phylogenetic inference. It is common practice to mask (remove) these ambiguously aligned regions prior to performing phylogenetic inference. In particular, David Lane’s (1991) chapter “16S/23S rRNA sequencing” proposed masking SSU data prior to phylogenetic analysis. However, knowing how to deal with ambiguously aligned regions and when to apply masks largely depends on the marker genes being analyzed and the question being asked of the data.
Keep in mind that this is still an active area of discussion, as highlighted by the following non-exhaustive list of articles: Wu et al. 2012, Ashkenazy et al. 2018, Schloss 2010, Tan et al. 2015, Rajan 2015.
How to mask alignment.
For our purposes, we’ll assume that we have ambiguously aligned columns in the MAFFT alignment we produced above. The default settings for the
--p-min-conservation of the alignment mask plugin approximates the Lane mask filtering of QIIME 1. Keep an eye out for updates to the alignment plugin.
qiime alignment mask --i-alignment aligned-rep-seqs.qza --o-masked-alignment masked-aligned-rep-seqs.qza
Reference based alignments
There are a variety of tools such as PyNAST (using NAST), Infernal, and SINA, etc., that attempt to reduce the amount of ambiguously aligned regions by using curated reference alignments (e.g. SILVA). Reference alignments are particularly powerful for rRNA gene sequence data, as knowledge of secondary structure is incorporated into the curation process, thus increasing alignment quality. For a more in-depth and eloquent overview of reference-based alignment approaches, check out the great SINA community tutorial.
Note: Alignments constructed using reference based alignment approaches can be masked too, just like the above MAFFT example. Also, the reference alignment approach we are discussing here is distinct from the reference phylogeny approach (i.e. fragment insertion) we mentioned earlier. That is, we are not inserting our data into an existing tree, but simply trying to create a more robust alignment for making a better de novo phylogeny.
Construct a phylogeny
As with MSA algorithms, phylogenetic inference tools are also legion. Fortunately, there are many great resources to learn about phylogentics. Below are just a few introductory resources to get you started:
- Phylogeny for the faint of heart: a tutorial
- Molecular phylogenetics: principles and practice
- Phylogenetics: An Introduction
q2-phylogeny plugin of , there are several methods for phylogenetic inference based on the following tools:
and this pipeline:
FastTree is able to construct phylogenies from large sequence alignments quite rapidly. It does this by using the using a CAT-like rate category approximation, which is also available through RAxML (discussed below).
qiime phylogeny fasttree --i-alignment masked-aligned-rep-seqs.qza --o-tree fasttree-tree.qza --verbose
Run time: ~ 2 seconds
iTOL viewing: The tree (
qza) file we just generated, as well as others that we’ll produce throughout this tutorial, can be uploaded to iTOL and be interactively viewed. Even better, while viewing the tree topology in
Normal mode, you can drag and drop the alignment (
qza) file (the one you used to build the phylogeny) onto the tree visualization. This will create a
sequence alignment data set were you can directly view the sequence alignment alongside the phylogeny.
raxml will perform a single phylogentic inference and return a tree. Note, the default model for
--p-substitution-model GTRGAMMA. If you’d like to construct a tree using the CAT model like
fasttree, simply replace
GTRCAT as shown below:.
qiime phylogeny raxml --p-substitution-model GTRCAT --i-alignment masked-aligned-rep-seqs.qza --o-tree raxml-cat-tree.qza
Run time: ~2 minutes.
Perform multiple searches using
If you’d like to perform a more thorough search of “tree space” you can instruct
raxml to perform multiple independent searches on the full alignment by using
--p-n-searches 5. Once these 5 independent searches are completed, only the single best scoring tree will be returned. Note, we are not bootstrapping here, we’ll do that in a later example. Let’s set
--p-substitution-model GTRCAT. Finally, let’s also manually set a seed via
--p-seed. By setting our seed, we allow other users the ability to reproduce our phylogeny. That is, anyone using the same sequence alignment and substitution model, will generate the same tree as long as they set the same seed value. Although,
--p-seed is not a required argument, it is generally a good idea to set this value.
qiime phylogeny raxml --p-substitution-model GTRCAT --p-seed 1723 --p-n-searches 5 --i-alignment masked-aligned-rep-seqs.qza --o-tree raxml-cat-searches-tree.qza --verbose
Run time: ~6 minutes.
In phylogenetics, it is good practice to check how well the splits / bipartitions in your phylogeny are supported. Often one is interested in which clades are robustly separated from other clades in the phylogeny. One way, of doing this is via bootstrapping (See the Bootstrapping section of the first introductory link above). In QIIME 2, we’ve provided access to the RAxML rapid bootsrap feature. The only difference between this command and the previous are the additional flags
--p-rapid-bootstrap-seed. It is quite common to perform anywhere from 100 - 1000 bootstrap replicates. The
--p-rapid-bootstrap-seed works very much like the
--p-seed argument from above except that it allows anyone to reproduce the bootstrapping process and the associated supports for your splits.
- Bootstrap the input alignment 100 times and perform a Maximum Likelihood (ML) search on each.
- Find best scoring ML tree through multiple independent searches using the original input alignment. The number of independent searches is determined by the number of bootstrap replicates set in the 1st step. That is, your search becomes more thorough with increasing bootstrap replicates. The ML optimization of RAxML uses every 5th bootstrap tree as the starting tree for an ML search on the original alignment.
- Map the bipartitions (bootstrap supports, 1st step) onto the best scoring ML tree (2nd step).
qiime phylogeny raxml-rapid-bootstrap --p-seed 1723 --p-rapid-bootstrap-seed 9384 --p-bootstrap-replicates 100 --p-substitution-model GTRCAT --i-alignment masked-aligned-rep-seqs.qza --o-tree raxml-cat-bootstrap-tree.qza --verbose
Run time: ~18 minutes.
RAxML Run Time
You may gave noticed that we’ve added the flag
--p-raxml-version to both RAxML methods. Here, we are providing a means to simply access versions of RAxML that have optimized vector instructions for various modern x86 processor architectures. Paraphrased from the RAxML manual and help documentation:
- Most recent processors (last ~4 years) will support SSE3 vector instructions. Newer processors (i.e. within the last ~2 years will likely support the faster AVX2 vector instructions.
- These instructions will substantially accelerate the likelihood and parsimony computations. SSE3 versions will run approximately 40% faster than the standard version. The AVX2 version will run 10-30% faster than the SSE3 version.
Additionally, for larger sequence alignments, you can:
- Make use of multiple cores / threads as outlined earlier. Keep in mind that using more cores / threads is not necessarily always better. This this thread in the RAxML forum. Additionally, the RAxML manual suggests 1 core per ~500 DNA alignment patterns. This is usually visible on screen, when the
--verboseoption is used.
- Try using a rate category (CAT model; via
--p-substitution-model), which results in equally good trees as the GAMMA models and is approximately 4 times faster. See the paper here. The CAT approximation is also Ideal for alignments containing 10,000 or more taxa, and is very much similar the CAT-like model of FastTree2.
Similar to the
raxml-rapid-bootstrap methods above, we provide similar functionality for IQ-TREE:
iqtree-ultrafast-bootstrap. IQ-TREE is unique compared to the
raxml options, in that it provides access to 286 models of nucleotide substitution! See here for more details on nucleotide models. IQ-TREE can also determine which of these models best fits your dataset prior to constructing your tree via its built-in ModelFinder algorithm. This is the default in QIIME 2, but do not worry, you can set any one of the 286 models of nucleotide substitution via the
--p-substitution-model flag, e.g. you can set the model as
HKY+I+G instead of the default
MFP. Keep in mind the additional computational time required for model testing via ModelFinder. You can read more about IQ-TREE here.
The simplest way to run the
iqtree command with default settings and automatic model selection (
MFP) is like so:
qiime phylogeny iqtree --i-alignment masked-aligned-rep-seqs.qza --o-tree iqt-tree.qza --verbose
Run time: ~10 minutes.
Specifying a substitution model
We can also set a substitution model of our choosing. You may have noticed while watching the onscreen output of the previous command that the best fitting model selected by ModelFinder is noted. For the sake of argument, let’s say the best selected model was shown as
F is only a notation to let us know that if a given model supports unequal base frequencies, then the empirical base frequencies will be used by default. Using empirical base frequencies (
F), rather than estimating them via likelihood, greatly reduces computational time. The
iqtree plugin will not accept
F within the model notation supplied at the command line, as this will always be implied automatically for the appropriate model. Also, the
iqtree plugin only accepts
G4 to be specified within the model notation. The
4 is simply another explicit notation to remind us that four rate categories are being assumed by default. The notation approach used by the plugin simply helps to retain simplicity and familiarity when supplying model notations on the command line. So, in brief, we only have to type
GTR+I+G as our input model:
qiime phylogeny iqtree --p-substitution-model 'GTR+I+G' --i-alignment masked-aligned-rep-seqs.qza --o-tree iqt-gtrig-tree.qza --verbose
Run time: ~3 minutes.
Let’s rerun the command above and add the
--p-fast option. This option, only compatible with the
iqtree method, resembles the fast search performed by
fasttree. Secondly, let’s also perform multiple tree searches and keep the best of those trees (as we did earlier with the
raxml --p-n-searches ... command):
qiime phylogeny iqtree --p-substitution-model 'GTR+I+G' --i-alignment masked-aligned-rep-seqs.qza --o-tree iqt-gtrig-fast-ms-tree.qza --p-fast --p-n-runs 10 --verbose
Run time: ~ 50 seconds.
Single branch tests
IQ-TREE provides access to a few single branch testing methods.
--p-alrt [INT >= 1000]
--p-abayes [TRUE | FALSE]
local bootstrap test via
--p-lbp [INT >= 1000]
Single branch tests are commonly used as an alternative to the bootstrapping approach we’ve discussed above, as they are substantially faster and often recommended when constructing large phylogenies (e.g. >10,000 taxa). All three of these methods can be applied simultaneously and viewed within iTOL as separate bootstrap support values. These values are always in listed in the following order of alrt / lbp / abayes. We’ll go ahead and apply all of the branch tests in our next command, while specifying the same substitution model as above. Feel free to combine this with the
qiime phylogeny iqtree --i-alignment masked-aligned-rep-seqs.qza --o-tree iqt-sbt-tree.qza --p-alrt 1000 --p-abayes --p-lbp 1000 --p-substitution-model 'GTR+I+G' --verbose
Run time: ~3 minutes.
IQ-TREE search settings
There are quite a few adjustable parameters available for
iqtree that can be modified improve searches through “tree space” and prevent the search algorithms from getting stuck in local optima. One particular “best practice” to aid in this regard, is to adjust the following parameters:
--p-stop-iter. In brief, the larger the value for NNI (nearest-neighbor interchange) perturbation, the larger the jumps in “tree space”. This value should be set high enough to allow the search algorithm to avoid being trapped in local optima, but not to high that the search is haphazardly jumping around “tree space”. That is, like Goldilocks and the three s you need to find a setting that is “just right”, or at least within a set of reasonable bounds. One way of assessing this, is to do a few short trial runs using the
--verbose flag. If you see that the likelihood values are jumping around to much, then lowering the value for
--p-perturb-nni-strength may be warranted. As for the stopping criteria, i.e.
--p-stop-iter, the higher this value, the more thorough your search in “tree space”. Be aware, increasing this value may also increase the run time. That is, the search will continue until it has sampled a number of trees, say 100 (default), without finding a better scoring tree. If a better tree is found, then the counter resets, and the search continues.
These two parameters deserve special consideration when a given data set contains many short sequences, quite common for microbiome survey data. We can modify our original command to include these extra parameters with the recommended modifications for short sequences, i.e. a lower value for perturbation strength (shorter reads do not contain as much phylogenetic information, thus we should limit how far we jump around in “tree space”) and a larger number of stop iterations. See the IQ-TREE command reference for more details about default parameter settings. Finally, we’ll let
iqtree run in fast mode, perform the model testing, and automatically determine the optimal number of CPU cores to use.
qiime phylogeny iqtree --i-alignment masked-aligned-rep-seqs.qza --o-tree iqt-nnisi-fast-tree.qza --p-perturb-nni-strength 0.2 --p-stop-iter 200 --p-n-cores 0 --p-fast --verbose
Run time: ~15 minutes using 2 cores. This includes model testing and tree search.
As per our discussion in the
raxml-rapid-bootstrap section above, we can also use IQ-TREE to evaluate how well our splits / bipartitions are supported within our phylogeny. This is done through the ultrafast bootstrap command.
To perform an ultrafast bootstrap analysis, we’ll use the defaults for automatic model selection (
1000 bootstrap replicates (minimum required), and use the same generally suggested parameters for constructing a phylogeny using short sequences. Again, we’ll let
iqtree-ultrafast-bootstrap automatically determine the optimal number of CPU cores to use:
qiime phylogeny iqtree-ultrafast-bootstrap --i-alignment masked-aligned-rep-seqs.qza --o-tree iqt-nnisi-bootstrap-tree.qza --p-perturb-nni-strength 0.2 --p-stop-iter 200 --p-n-cores 0 --verbose
Run time: ~30 minutes using 2 cores. This includes model testing and bootstrapping.
Perform single branch tests alongside ufboot
We can also apply single branch test methods concurrently with ultrafast bootstrapping. The support values will always be represented in the following order: alrt / lbp / abayes / ufboot. Again, these values can be seen as separately listed bootstrap values in iTOL. We’ll also specify a model as we did earlier.
qiime phylogeny iqtree-ultrafast-bootstrap --i-alignment masked-aligned-rep-seqs.qza --o-tree iqt-nnisi-bootstrap-sbt-gtrig-tree.qza --p-perturb-nni-strength 0.2 --p-stop-iter 200 --p-n-cores 0 --p-alrt 1000 --p-abayes --p-lbp 1000 --p-substitution-model 'GTR+I+G' --verbose
Run time: ~38 minutes using 2 cores. This includes bootstrapping and single branch testing.
Finally, if we’d like to reduce the impact of potential model violations that occur during our UFBoot search, and / or would simply like to be more rigorous, we can add the
--p-bnni option to any of the
iqtree-ultrafast-bootstrap commands above.
Root the phylogeny
In order to make proper use of diversity metrics such as UniFrac, the phylogeny must be rooted, see here for more information about rooting. In general, phylogenetic inference tools using Maximum Likelihood often return an unrooted tree by default.
QIIME 2 provides a way to mid-point root our phylogeny. We plan to add other rooting options in the future. For now, we’ll root our bootstrap tree from
iqtree-ultrafast-bootstrap like so:
qiime phylogeny midpoint-root --i-tree iqt-nnisi-bootstrap-sbt-gtrig-tree.qza --o-rooted-tree iqt-nnisi-bootstrap-sbt-gtrig-tree-rooted.qza
iTOL viewing Reminder: We can view our tree and its associated alignment via iTOL. All you need to do is upload the
iqt-nnisi-bootstrap-sbt-gtrig-tree-rooted.qza tree file. Display the tree in
Normal mode. Then drag and drop the
masked-aligned-rep-seqs.qza file onto the visualization. Now you can view the phylogeny alongside the alignment. Below is a link to an example screen-shot of the tree & sequence alignment visualization from iTOL:
Here we will outline the use of the phylogeny pipeline
One advantage of pipelines is that they combine ordered sets of commonly used commands, into one condensed simple command. To keep these “convenience” pipelines easy to use, it is quite common to only expose a few options to the user. That is, most of the commands executed via pipelines are often configured to use default option settings. However, options that are deemed important enough for the user to consider setting, are made available. The options exposed via a given pipeline will largely depend upon what it is doing. Pipelines are also a great way for new users to get started, as it helps to lay a foundation of good practices in setting up standard operating procedures.
Rather than run one or more of the following QIIME 2 commands listed below:
qiime alignment mafft ...
qiime alignment mask ...
qiime phylogeny fasttree ...
qiime phylogeny midpoint-root ...
We can make use of the pipeline
align-to-tree-mafft-fasttree to automate the above four steps in one go. Here is the description taken from the pipeline help doc:
This pipeline will start by creating a sequence alignment using MAFFT, after which any alignment columns that are phylogenetically uninformative or ambiguously aligned will be removed (masked). The resulting masked alignment will be used to infer a phylogenetic tree and then subsequently rooted at its midpoint. Output files from each step of the pipeline will be saved. This includes both the unmasked and masked MAFFT alignment from q2-alignment methods, and both the rooted and unrooted phylogenies from q2-phylogeny methods.
This can all be accomplished by simply running the following:
qiime phylogeny align-to-tree-mafft-fasttree --i-sequences rep-seqs.qza --output-dir mafft-fasttree-output
Or you can specify output for each file specifically via the flags
Congratulations! You now know how to construct a phylogeny in QIIME 2!