How to determine the components of PC1?

April_Oliver · February 25, 2022, 10:07pm

I am using version 2021.11 installed on WSL. After running diversity core-metrics-phylogenetic, I looked at the weighted Unifrac Emporer plot, which showed that PC1 accounted for almost 75% of variation. I'm really interested in which features contribute to PC1, but I can't seem to find any tools that allow me to view this information. I found one forum post about it that recommended using the Qiime2R package, but was hoping that I could do this easily in Linux.

jwdebelius · February 25, 2022, 11:38pm

Hi @April_Oliver,

This is a ...complicated problem. So, I'll start with some theory that you can skip over if you want to... and then offer a couple of practical solutions which may or may not be useful.

Theory

The visualization you're looking at is a PCoA (Principal Coordinates Analysis). It gets generated in 3(ish) steps under the hood:

Calculate a distance matrix using the data table (and tree.)
Perform PCoA projection on the distance matrix to give coordinates
Visualize the PCoA

However, the metric we use is a summary statistic, in that we can map features to the value but we can't untangle them. So, if I have an weighted UniFrac distance of 0.5, I can say that 50% of the branch weights are shared between the two groups but I can't tell you which branches were shared between two samples.

Then, we take those distances and essentially try to make them into a map by stretching, rotating, etc... Here are a couple of focused articles that can give you more details.

Then... we build the visualization.

The problem is that when you see clustering in the visualization, it's based on the PCoA calculation, which is based on the distance calculation, which doesn't remember the features.

But, we can hack the visualization to get back to the features.

Hacks

The traditional hack to get feature placement is to use a biplot where the taxonomy table gets projected in the PCoA space. This will place arrows in your final PCoA plot that show you where the feature should appear.

A second (newer!) solution is to use an empire plot, which is available through the empress plugin. This interactive visualization will let you highlight the features that are specific to a set of samples, or a set of samples that contain specific features.

Alternatives

If you don't like the hacks there is a metric that lets you map the ordination space to both samples and features . DEICODE is a robust PCA technique, designed specifically for microbiome data. (It knows how to handle both the compositionality and sparsity). It specifically places both the features and the samples together into the ame ordination space. You can visualize the ordination as a biplot in Emperor. You can also pass the coordinates into qurro to see specifically which features are separating the axis and/or calculate an ALR that you can pass to your favorite programs.

My suspicion is that in your case, if the difference is so pronounced in PCoA space, it will be pronounced in rPCA space.

Best,
Justine

system · March 29, 2022, 1:51pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.