Robust Aitchison PCA Beta Diversity with DEICODE

cmartino · February 12, 2019, 4:15pm

DEICODE

(pronounced like decode /de.ko.de/)
Documentation available in the pugin library.

DEICODE is a form of Aitchison Distance that is robust to high levels of sparsity. DEICODE utilizes a natural solution to the zero problem formulated in recommendation systems called matrix completion. A simple way to interpret the method is, as a robust compositional PCA (via SVD) where zero values do not influence the resulting ordination. One of the benefits of using DEICODE is the ability to reveal salient inter-community niche feature importance in compositional biplots. These biplot can be easily visualized in the existing QIIME architecture through Emperor.

Installation within a QIIME Environment

If you have not already done so, activate your QIIME environment.

source activate qiime2-20xx.x

DEICODE is available for installation through pip or conda:

pip install deicode

Note: the conda install is only supported for Qiime>=2019.1

conda install -c conda-forge deicode

Tutorial

Note: This guide assumes you have installed QIIME using one of the procedures in the install documents and have installed DEICODE.

Introduction

In this tutorial you will learn how to interpret and perform Robust Aitchison PCA through QIIME. The focus of this tutorial is compositional beta diversity. There are many beta diversity metrics that have been proposed, all with varying benefits on varying data structures. However, presence/absence metric often prove to give better results than those that rely on abundances (i.e. unweighted vs. weighted UniFrac). One component of this phenomenon is that the interpretation of relative abundances can provide spurious results (see the differential abundance analysis introduction). One solution to this problem is to use a compositional distance metric such as Aitchison distance.

As a toy example let’s build three taxa. These three taxa represent common distributions we see in microbiome datasets. Where the first taxon is increasing exponentially across samples, this is a trend that we would be interested in. However, taxon 2 and 3 have much higher counts and taxon 3 is randomly fluctuating across samples.

In our distances below we have Euclidean, Bray-Curtis, Jaccard, and Aitchison distances (from left to right). We can see that the abundance based metrics Euclidean and Bray-Curtis are heavily influenced by the abundance of taxon 3 and seem to randomly fluctuate. In the presence/absence metric, Jaccard, we see that the distance saturates to one very quickly. However, in the Aitchison distance we see a linear curve representing taxon 1. The reason the distance is linear is because Aitchison distance relies on log transforms (the log of the exponential trend of taxon 1 is linear).

From this toy example, it is clear that Aitchison distance better accounts for the proportions. However, we made the unrealistic assumption in our toy example that there were no zero counts. In real microbiome datasets there are a large number of zeros (i.e. sparsity). Sparsity complicates log ratio transformations because the log-ratio of zero is undefined. To solve this, pseudo counts are often used but that can often skew results (see Naught all zeros in sequence count data are the same).

Robust Aitchison PCA solves this problem in two steps:

1. Compostional preprocessing using the centered log ratio transform on only the non-zero values of the data (no pseudo count)

2. Dimensionality reduction through Robust PCA on only the non-zero values of the data ( matrix completion).

To demonstrate this in action we will run an example dataset below, where the output can be viewed as a compositional biplot through Emperor.

Example

In this example we will use Robust Aitchison PCA via DEICODE on the “Moving Pictures” tutorial, if you have not yet completed the tutorial it can be found here. The dataset consists of human microbiome samples from two individuals at four body sites at five timepoints, the first of which immediately followed antibiotic usage (Caporaso et al. 2011). If you have completed this tutorial run the following command and skip the download section.

cd qiime2-moving-pictures-tutorial

If you have skipped the tutorial but would like to get started quickly, the data files needed for the DEICODE tutorial must be downloaded below.

mkdir qiime2-moving-pictures-tutorial
cd qiime2-moving-pictures-tutorial

Table view | download

save as: table.qza

Sample Metadata download

save as: sample-metadata.tsv

Feature Metadata view | download

save as: taxonomy.qza

Using table.qza, of the type raw count table (FeatureTable[Frequency]), we will generate our beta diversity ordination file. There are a few parameters to DEICODE that we may want to consider. The first is filtering cutoffs, these are p-min-feature-count and p-min-sample-count. Both of these parameters accept integer values and remove feature or samples, respectively, with sums below this cutoff. The feature cut-off is useful in the case that features with very low total counts among all samples represent contamination or chimeric sequences. The sample cut off is useful for the case that some sample received very few reads relative to other samples.

Note: it is not recommended to bin your features by taxonomic assignment (i.e. by genus level).
Note: it is not recommended to rarefy your data before using DEICODE.

The other two parameters are --p-rank and --p-iterations. These parameters should rarely have to change from the default. However, the minimum value of --p-rank can be 1 and the maximum recommended value is 10. Similarly, the minimum value of --p-iterations is 1 and is recommended to be below 500.

Now that we understand the acceptable parameters, we are ready to run DEICODE.

 qiime dev refresh-cache

 qiime deicode rpca \
    --i-table table.qza \
    --p-min-feature-count 10 \
    --p-min-sample-count 500 \
    --o-biplot ordination.qza \
    --o-distance-matrix distance.qza

Output:

ordination.qza download
distance.qza download

Now that we have our ordination file, with type (PCoAResults % Properties(['biplot'])), we are ready to visualize the results. This can be done using the Emperor biplot functionality. In this case we will include metadata for our features (optional) and our samples (required).

qiime emperor biplot \
    --i-biplot ordination.qza \
    --m-sample-metadata-file sample-metadata.tsv \
    --m-feature-metadata-file taxonomy.qza \
    --o-visualization biplot.qzv \
    --p-number-of-features 8

Output:

biplot.qzv download

Biplots are exploratory visualization tools that allow us to represent the features (i.e. taxonomy or OTUs) that strongly influence the principal component axis as arrows. The interpretation of the compositional biplot differs slightly from classical biplot interpretation (we can view the qzv file at view.qiime2. The important features with regard to sample clusters are not a single arrow but by the log ratio between features represented by arrows pointing in different directions. A visualization tool for these log ratios is coming soon to QIIME.

From this visualization we noticed that BodySite seems to explain the clusters well. We can run PERMANOVA on the distances to get a statistical significance for this.

 qiime diversity beta-group-significance \
    --i-distance-matrix distance.qza \
    --m-metadata-file sample-metadata.tsv \
    --m-metadata-column BodySite \
    --p-method permanova \
    --o-visualization BodySite_significance.qzv

Output:

BodySite_significance.qzv download

Indeed we can now see that the clusters we saw in the biplot were significant by viewing the BodySite_significance.qzv at view.qiime2.

Citation

C. Martino et al., A Novel Sparse Compositional Technique Reveals Microbial Perturbations. mSystems. 4 (2019), doi:10.1128/mSystems.00016-19.

Other Resources

Documentation on GitHub
The code for OptSpace was translated to python from a MATLAB package maintained by Sewoong Oh (UIUC).

Nicholas_Bokulich · February 14, 2019, 1:46pm

3 off-topic replies have been split into a new topic: How to make pcoa biplot in R using q2-deicode ordination

Please keep replies on-topic in the future.

Nicholas_Bokulich · March 8, 2019, 8:01pm

An off-topic reply has been split into a new topic: Deicode installation error

Please keep replies on-topic in the future.

thermokarst · March 16, 2019, 8:34pm

An off-topic reply has been split into a new topic: Help understanding DEICODE

Please keep replies on-topic in the future.

cxf514 · October 3, 2019, 3:29pm

thanks for the development of DEICODE, it is really powerful.
l would like to know when the visualization tool you said comes.

yanxianl · October 3, 2019, 6:48pm

It's already available in the QIIME2 library. Check it out here.

fedarko · October 4, 2019, 2:07am

Hi @cxf514! As @yanxianl mentioned, the visualization tool mentioned here (Qurro) is already available for QIIME 2.

We're still working on improving Qurro's documentation and functionality (...and making a QIIME 2 forum post about it), but feel free to try it out alongside your DEICODE results! Feel free to make a new post on the forum if you have any questions about the tool

cxf514 · October 4, 2019, 2:32am

you are a hero(^0^)/

fedarko · May 11, 2020, 6:24pm

3 posts were split to a new topic: Questions about interpreting DEICODE and Qurro output

thermokarst · July 27, 2021, 11:09pm

An off-topic reply has been split into a new topic: DEICODE: can I use sample metadata for the arrows to indicate what is driving the dissimilarity

Please keep replies on-topic in the future.

Nicholas_Bokulich · August 12, 2022, 7:13pm

An off-topic reply has been split into a new topic: feature request: could DEICODE accept a non-biom table?

Please keep replies on-topic in the future.

kam · October 22, 2022, 10:00am

Thanks for this great tool!

@cmartino you have wrote in the publication: "This method could possibly be adapted to or combined with other omics paradigms (e.g., metabolomics, metatranscriptomics, and metagenomics)".

Did you mean that robust compositional PCA is appropriate to use with other omics data (e.g. gene count, etc.)? Have you checked this method in comparison to other distances methods in other contexts then taxa abundance?

cmartino · October 27, 2022, 8:43pm

Hi @kam,

We don't have any formal benchmarks currently published outside of taxonomic profiles obtained from amplicon or shotgun data. However, many (including myself) have used it with success on gene profiles and even metabolomics/proteomics data.

kam · January 11, 2023, 3:21pm

Once again thanks for this extremely nice plugin, would like to have another question:

The distance matrix created gives values of more than 1, e.g 3,4.. How can I interpert it? given that in most beta diversity metrics the value is between 0 and 1.

cmartino · January 11, 2023, 3:44pm

Thanks for using the tool. Not all distances are bounded between 0 and 1, RPCA outputs an Aitchison distance (Euclidean distance on centered log-ratio transformed data). You can proceed how you normally would (e.g. PERMANOVA, between/within distance plots). You can read here and here for more about Aitchison distances.