Help with understanding beta diversity calculation + results

fgara · October 30, 2020, 1:38pm

Hello,

May I ask for a help in understanding how are beta diversity values calculated in QIIME2, please?

I have several questions:

These beta diversity values are calculated using PCoA via scikit-bio (e.g. http://scikit-bio.org/docs/0.5.0/diversity.html), am I right?
When I use PCoA for my data set that has hundreds of different taxa, how can I know which taxa contribute most to the PCoA axes?
I read somewhere on this forum, that if there is same distance between 2 points in axis1 and there is same distance between 2 points in axis2, the actual difference is greater between those points in axis1, because axis1 is more important that axis2? Could anyone help explaining this further please?
Is the first principal coordinate axis the line defined by the first eigenvector?
I watched a Youtube video that says that PCoA is similar to PCA, in the fact that they both have loadings/loading scores... what are these loading scores exactly?
Where can I find the PCoA loading scores in QIIME2?
What does qiime diversity pcoa-biplot do?

Many thanks for your generous help!

ChrisKeefe · October 30, 2020, 6:14pm

Lots of good questions, @fgara. I'll tackle the ones I'm confident in, and try to get you resources for the others.

You've got the order of operations a little mixed up, but you're headed in the right direction. When making a PCoA plot, distance matrices (distances between samples) are calculated first, and then PCoA results are produced from the distance matrices.

The distance matrices are built using skbio or unifrac. Most of the skbio calculations are actually passed off to sklearn or scipy (details here), but some are implemented in q2-diversity-lib.

Biplots are a great way of doing exactly this!
I made this example with diversity pcoa_biplot, and then visualized it with emperor biplot.

Each arrow describes one feature's contribution to the difference described by the PCoA plot. The red arrow (feature b3c...965a) is the biggest; it contributes the most. The purple arrow (82b...cda) is the smallest of the five most-prominent features; it contributes the fifth-most. You can adjust how many arrows are plotted with emperor biplot.

If you want to know what features these are, use a classifier to create a FeatureData[Taxonomy] artifact, and visualize it with qiime metadata tabulate. You'll get a nice list where you can search the feature ids you see at the ends of the arrows.

IIRC, the first principal coordinate axis is the vector with the largest eigenvalue - the most "important" vector, if you will. The percentage of variance each axis explains is displayed in parentheses at the end of that axis. Again, IIRC, this is the axis eigenvalue, divided by the sum of all axis eigenvalues. The "axes" tab of the visualization also has a nice little plot explaining percent variance explained by the top 5 axes.

I think the idea here is that how important one unit of difference is should be scaled by how important the axis is. If PC52 only explains .00001% of the variance in your samples, it doesn't matter much how far apart two points are on that axis. If, on the other hand, PC1 explains 45% of the total variance, differences between samples on PC1 contribute much more relative to the overall variance.

Unfortunately, @fgara, this is where you lose me. This ResearchGate post has a couple resources that might help you with how to interpret loadings in PCoA, but it sounds like they may not carry the same meaning as PCA loadings. I'm not sure they're available at all in QIIME 2 artifacts, but if they were, you'd probably have to extract them from a PCoAResults or PCoAResults % Properties('biplot') Artifact. Maybe someone with more experience with the nuts and bolts of PCoA will weigh in on this.

Best,
Chris

fgara · October 30, 2020, 7:08pm

Hi Chris,

Wow, thank you so much for your very detailed answers!
I will read through the links you gave me and try the biplot.
I hope you don't mind me asking more questions if I have them later

Thank you so much once again for your generous time and kind help!

sbslee · November 2, 2020, 12:44pm

Hello @fgara,

In addition to @ChrisKeefe's helpful answer above, I just wanted to point out that PCA is equivalent to PCoA when the distance metric used is Euclidean (i.e. PCA is a type of PCoA).

As for PCA loadings, according to this post, they are defined as:

PCA loadings are the coefficients of the linear combination of the original variables from which the principal components (PCs) are constructed.

In order words, you can treat them as relative weights of the variables for each PC.

The above post also explains how to access those loadings at least for sklearn.decomposition.PCA (see the components_ attribute in the docs as well).

Hope this helps!

fgara · November 2, 2020, 10:54pm

Hi @sbslee,

Thank you so much for your kind reply!
I followed your link for PCA on sklearn, and yes, I could see the components_ attribute.
And thank you for sharing the Github post by scentellegher! It's a very helpful post for understanding PCA.

It seems that QIIME2 does not use sklearn for computing the PCoA? Because according to this doc page, there is no PCoA module in sklearn (please correct me if I'm wrong).

I found this doc page for PCoA on skbio, so maybe that's what QIIME2 uses, but the page does not have info on components nor loadings.

thermokarst · November 3, 2020, 12:16am

A post was split to a new topic: May I ask a few more questions please?

sbslee · November 3, 2020, 12:30am

Hi @fgara,

This is me just speculating, so please take my answers with a huge grain of salt (anyone is welcome to correct me if I'm wrong!).

As far as I know, QIIME 2 uses the skbio.stats.ordination.pcoa method to perform PCoA (see the method's docs and also QIIME 2 implementation in the q2-diversity plugin). What I'm not so sure is whether the skbio.stats.ordination.pcoa method was written from scratch or is based on other package(s) like scikit-learn (I don't have time to look it up now; you're welcome to see yourself!).

In order to perform PCoA with sklearn, you need to use the sklearn.manifold.MDS class (see the docs). MDS stands for multidimensional scaling, and when you use metric=True then it's equivalent to PCoA. For more information, please see MDS page from Wikipedia (in particular, see the Classical multidimensional scaling section).

Nice, so you did find out about the skbio.stats.ordination.pcoa method! As for why there are (apparently) no loadings for PCoA (or MDS in sklearn for that matter), as opposed to PCA, I'm guessing that's because PCoA does not combine the variables linearly for each axis except for when Euclidean distance is used (i.e. there are no simple coefficients to report)? Below I'm just copying and pasting my previous reply on the case of PCA as comparison.

This is as far as my small knowledge of ordination gets me. This is an interesting topic for sure and that's why I wanted to jump in. It would be nice to have other people with more knowledge jumping into this discussion.

system · December 4, 2020, 6:30am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.