Hi @Rob_DNA,
Before we get started, I'd like to offer the disclaimer that I've not used redundency analysis; I did find a nice chapter on it in a statistical ecology book, so I apologize if I make some incorrect assumptions. Also that I have a cup of tea
, and you may want a beverage of your choice, because this is getting long.
TL;WR:
- PCA is based on Euclidean distance which is inappropraite for microbiome data
- Compositional PCA fails to address Sparsity
- QIIME 2 has DEICODE , which is a compositionally aware robust technique that deals with the zeros in a clever way and gives you an ordination
- We dont curerntly have RDA, but in theory it could be implemented. If you wanted to do that, theres a lot of support for new plugin development.
- Adonis could give you something like the loading ranking for the technique, although it's not perfect.
Okay, with that said... my tea is hot, and I'm ready to go meander through ordination.
I want to start with the difference between PCA and PCoA, for myself and for future (
) readers.
Principal components analysis (PCA) is an ordination technique that accepts the data, performs a distance transformation - usually a Euclidean transform - and then creates an ordination/map. Because it knows the data that's gone in, it can also place feature landmarks in that space.
Principle Coordinates analysis (PCoA) takes a distance matrix in. This means it's agnostic to the distance transform that's been performed on the data. You can use any distance metric you like to create this ordination; it makes fewer assumptions about the distribution of the data. However, because you're working on already transformed data, you can't go backwards and add in features.
It's worth noting that PCoA on a euclidean distance is equalivant to PCA for a given transform, in terms of sample space.
PCA can better help you contextualize your data if you meet it's assumptions.
PCoA is great because you can get ordination with fewer assumptions about your data.
Traditionally, and outside of specific transformations on the data, we've avoided PCA because it's based on Euclidean distance. Euclidean distance on untransformed microbiome data isn't. It tends to poorly occupy the ordinations space (you see spikes or stars around the axes). This figure, particularly panel E from Hamedy and Knight, 2009 is often cited as one of those reasons... particular the spikyness and spacing of the Euclidean distance.
Because it's not a great approach, the qiime2 devs decided not to implement it. Sometimes, there are guard rails in place to discourage people from making less effective analysis decisions, and this is one of them.
The second issue with classic, euclidean-based PCA is compositionality. Essentially, our microbiome data is constrained and adds up to 1. If you haven't read *Microbiome datasets are compositional and this is not optional, I'd highly recommend it. Before you can PCA, you'd need to do a compositional transform.
Luckily... Aitchison distance is a compositional euclidean distance! So, theoretically, you could do a compositional PCA based on CLR-transformed data and make a compositional PCA! It was possible, it was a choice not to, but it was possible... (As a note, PhILR in R also offers a PCA based on an ILR transform; since PhILR isnt a plugin this is obviously unavaliable in QIIME 2, but totally an option in R).
Unfortunately... Aitchison has a major problem that would show up in a PCA tranform. Most microbiome enviroments are sparse. (There's a whole literature on this sparsity). I primary work in guts, and back of the envelope, I expect about 80% of my features to be present in less than 20% of my samples. That means a lot of zeros. We can, of course, do zero substitution - in QIIME 2, that's done by adding a pseudocount - other approaches will substitute in a very low value. A pseudocount ends up weighting your data more heavily toward sparse features, and so your ordination becomes a reflection of that sparsity.
You could also filter the data, but how, I think is a whole other issue I can't answer. I think you'd have to do a series of tests, pick the correct filtering appoach, and...
A second piece of this transform is that because of the CLR and zero-substitution, Aitchison distance, (and Aitchison PCA) are depth sensitive.
DEICODE solves both of these issues and has the advantage of being a PCA implemented for QIIME 2. Basically, DEICODE takes your data, performs a partial CLR transform, and then uses mathematical magic (sparse matrix closure) to solve an ordination while dealing with the zeros. The underlying math is beyond me, but when I tried to get it through my linear algebra-less brain, I found the github tutorials super helpful.
The coordinates generated are a nice biplot with feature and sample loadings, and the both Emperor and Qurro are set up to accept these and let you do pretty visualization.
What about RDA?
In theory, RDA could be added to QIIME 2 as a plugin. AFAIK, it doesn't exist now, but it could in theory. I dont know how that would play with qurro, emperor, etc. That would depend on the developers/maintainers of those packages and I'm primarily a users who occasionally requests weird features.
However, there is a way to get that pseudo feature ranking. Adonis will provide an effect size ranking for covariates of interest with both continuous and categorical data. You can even adjust for other factors (as long as you set up the equation correctly.) The current qiime2 adonis implementation does this process one at a time, so you'd have to harvest the data yourself to turn it into a visualization. You can see one example of that adonis work in panel B of figure 1 from He et al, showing their goegraphic effect is 5x any other factor:
It's not quite that nice factor-loading table, but its a semi-convenient way to look at the data.
Hopefully this helps, and let us know if you've got any questions!
Best,
Justine