Why is there no principal component and redundancy analysis in QIIME2?

Rob_DNA · July 21, 2022, 10:03am

Hello,

PCoA (on e.g. bray-curtis dissimilarity matrix or UniFrac distances) is widely used in QIIME2, but why are principal component analysis (PCA) and redundancy analysis (RDA) not implemented in QIIME2? These are also widely used multivariate analyses.

Thank you very much.

Regards,
Rob

jwdebelius · July 21, 2022, 2:02pm

Hi @Rob_DNA,

Before we get started, I'd like to offer the disclaimer that I've not used redundency analysis; I did find a nice chapter on it in a statistical ecology book, so I apologize if I make some incorrect assumptions. Also that I have a cup of tea , and you may want a beverage of your choice, because this is getting long.

TL;WR:

PCA is based on Euclidean distance which is inappropraite for microbiome data
Compositional PCA fails to address Sparsity
QIIME 2 has DEICODE , which is a compositionally aware robust technique that deals with the zeros in a clever way and gives you an ordination
We dont curerntly have RDA, but in theory it could be implemented. If you wanted to do that, theres a lot of support for new plugin development.
Adonis could give you something like the loading ranking for the technique, although it's not perfect.

Okay, with that said... my tea is hot, and I'm ready to go meander through ordination.

I want to start with the difference between PCA and PCoA, for myself and for future () readers.

Principal components analysis (PCA) is an ordination technique that accepts the data, performs a distance transformation - usually a Euclidean transform - and then creates an ordination/map. Because it knows the data that's gone in, it can also place feature landmarks in that space.

Principle Coordinates analysis (PCoA) takes a distance matrix in. This means it's agnostic to the distance transform that's been performed on the data. You can use any distance metric you like to create this ordination; it makes fewer assumptions about the distribution of the data. However, because you're working on already transformed data, you can't go backwards and add in features.

It's worth noting that PCoA on a euclidean distance is equalivant to PCA for a given transform, in terms of sample space.

PCA can better help you contextualize your data if you meet it's assumptions.
PCoA is great because you can get ordination with fewer assumptions about your data.

Traditionally, and outside of specific transformations on the data, we've avoided PCA because it's based on Euclidean distance. Euclidean distance on untransformed microbiome data isn't. It tends to poorly occupy the ordinations space (you see spikes or stars around the axes). This figure, particularly panel E from Hamedy and Knight, 2009 is often cited as one of those reasons... particular the spikyness and spacing of the Euclidean distance.

Because it's not a great approach, the qiime2 devs decided not to implement it. Sometimes, there are guard rails in place to discourage people from making less effective analysis decisions, and this is one of them.

The second issue with classic, euclidean-based PCA is compositionality. Essentially, our microbiome data is constrained and adds up to 1. If you haven't read *Microbiome datasets are compositional and this is not optional, I'd highly recommend it. Before you can PCA, you'd need to do a compositional transform.

Luckily... Aitchison distance is a compositional euclidean distance! So, theoretically, you could do a compositional PCA based on CLR-transformed data and make a compositional PCA! It was possible, it was a choice not to, but it was possible... (As a note, PhILR in R also offers a PCA based on an ILR transform; since PhILR isnt a plugin this is obviously unavaliable in QIIME 2, but totally an option in R).

Unfortunately... Aitchison has a major problem that would show up in a PCA tranform. Most microbiome enviroments are sparse. (There's a whole literature on this sparsity). I primary work in guts, and back of the envelope, I expect about 80% of my features to be present in less than 20% of my samples. That means a lot of zeros. We can, of course, do zero substitution - in QIIME 2, that's done by adding a pseudocount - other approaches will substitute in a very low value. A pseudocount ends up weighting your data more heavily toward sparse features, and so your ordination becomes a reflection of that sparsity.
You could also filter the data, but how, I think is a whole other issue I can't answer. I think you'd have to do a series of tests, pick the correct filtering appoach, and...

A second piece of this transform is that because of the CLR and zero-substitution, Aitchison distance, (and Aitchison PCA) are depth sensitive.

DEICODE solves both of these issues and has the advantage of being a PCA implemented for QIIME 2. Basically, DEICODE takes your data, performs a partial CLR transform, and then uses mathematical magic (sparse matrix closure) to solve an ordination while dealing with the zeros. The underlying math is beyond me, but when I tried to get it through my linear algebra-less brain, I found the github tutorials super helpful.

The coordinates generated are a nice biplot with feature and sample loadings, and the both Emperor and Qurro are set up to accept these and let you do pretty visualization.

What about RDA?

In theory, RDA could be added to QIIME 2 as a plugin. AFAIK, it doesn't exist now, but it could in theory. I dont know how that would play with qurro, emperor, etc. That would depend on the developers/maintainers of those packages and I'm primarily a users who occasionally requests weird features.

However, there is a way to get that pseudo feature ranking. Adonis will provide an effect size ranking for covariates of interest with both continuous and categorical data. You can even adjust for other factors (as long as you set up the equation correctly.) The current qiime2 adonis implementation does this process one at a time, so you'd have to harvest the data yourself to turn it into a visualization. You can see one example of that adonis work in panel B of figure 1 from He et al, showing their goegraphic effect is 5x any other factor:

It's not quite that nice factor-loading table, but its a semi-convenient way to look at the data.

Hopefully this helps, and let us know if you've got any questions!

Best,
Justine

Rob_DNA · July 23, 2022, 6:18am

Hi Justine,

thank you very much for your elaborate answer!!

Indeed, I know the article by Gloor et al 2017, it is very interesting!

Few discussion points:

What's interesting in the article by Gloor, is that the following is noted about Bray-curtis/UniFrac/..: "they do not account for the compositional nature of the data.". So how appropriate is then using these techniques in microbiome data? I guess a lot of people here use these frequently. Also the authors show a new "compositional approach" which indeed uses a transformation (e.g. CLR) and then do PCA. The R package mixOmics, also apply this strategy (and including a pseudocount): mixMC Preprocessing | mixOmics
you mention that RDA could be implemented to QIIME2, but don't the pitfalls you mention for PCA also apply to RDA? If I'm not mistaken, RDA is in fact a PCA on the fitted values of a multiple linear regression model, right? But perhaps that removes some of the limitations of PCA? It is interesting as I found quite some papers applying RDA on 16S sequencing data (often not mentioning if they transformed the data or not, etc)

Could you perhaps share some thoughts on these points? I started reading more about compositionality of sequencing data etc and its consequences on different statistical methods and I'm really intrigued..

I'm now struggling a bit to analyse sequencing data (MinION metabarcoding data, so not Illumina) incl. various environmental variables and we are using and RDA on CLR transformed data (including +1 pseudocount)..at least the resulting data looks good (no horse shoe effect or other weird looking artifacts) and makes sense, but I'm not sure how correct the anaylsis is. For my soil microbiome Illumina data I use QIIME2 and then I started wondering why QIIME2 doesn't use PCA/RDA ..

Thank you for your time!

jwdebelius · July 25, 2022, 1:32pm

Hi @Rob_DNA,

I think this key is that maybe I articulated poorly:

DEICODE is a robust PCA technique. It solves the compositionality issues through a CLR transform, and the sparsity issue though a matrix transform. You could implement an RDA on top of this matrix; I dont know how it would integrate with other tools.

MixOmics doesn't solve the other issue, I raised: sparsity. It is truely one of the pain points in microbiome data. Figuring out how to filter vs add a pseudocount is a fundemental problem we haven't solved. That's part of the reason I think DEICODE is better. I think an Aitchison PCA would be great for something less sparse - like RNAseq data. My experience has been that its suboptimal with microbiome data.

So, there are a few pieces. One is that they may be less appropriate, but unlike classic euclidean distance, they still offer valuable insight. I also think in many microbiome systems, the phylogenetic aspect outweighs the issues around compositionality. Like, phylogeny can be really useful in addressing sparsity. (IIRC there's a phylogenetic rPCA in the works or published; )

Unweighted metrics (unweighted UniFrac; jaccard) try to escape compositionality using a boolean transform - no relative abundance or log fold change, just present/absent. They can still be subject to some of the compositionality issues: a highly uneven community will have a large effect. You also have covariance with richness; but they're good metrics and give you information about hte community.

Weighted UniFrac is, IME, probably one of the most stable metrics out there. Again, not compositional, but often a benefit. Its also, again, suffecient to provide biological insight. Bray Curtis sticks around because its an old-school ecological metric and for many people, the perfered hypothesis is that things only happen in abundant taxa .

I'd check your alpha diversity, with the +1 pseudocoutn and your sparsity and consider filtering. But, I dont work with soil so I dont know the sparsity assumptions.

Best,
Justine

Rob_DNA · August 11, 2022, 6:16am

Hi Justine,

thank you very much for your elaborate answer! It is a really interesting topic

system · September 11, 2022, 12:16pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.