Help with understanding beta diversity calculation + results

ChrisKeefe · October 30, 2020, 6:14pm

Lots of good questions, @fgara. I'll tackle the ones I'm confident in, and try to get you resources for the others.

You've got the order of operations a little mixed up, but you're headed in the right direction. When making a PCoA plot, distance matrices (distances between samples) are calculated first, and then PCoA results are produced from the distance matrices.

The distance matrices are built using skbio or unifrac. Most of the skbio calculations are actually passed off to sklearn or scipy (details here), but some are implemented in q2-diversity-lib.

Biplots are a great way of doing exactly this!
I made this example with diversity pcoa_biplot, and then visualized it with emperor biplot.

Each arrow describes one feature's contribution to the difference described by the PCoA plot. The red arrow (feature b3c...965a) is the biggest; it contributes the most. The purple arrow (82b...cda) is the smallest of the five most-prominent features; it contributes the fifth-most. You can adjust how many arrows are plotted with emperor biplot.

If you want to know what features these are, use a classifier to create a FeatureData[Taxonomy] artifact, and visualize it with qiime metadata tabulate. You'll get a nice list where you can search the feature ids you see at the ends of the arrows.

IIRC, the first principal coordinate axis is the vector with the largest eigenvalue - the most "important" vector, if you will. The percentage of variance each axis explains is displayed in parentheses at the end of that axis. Again, IIRC, this is the axis eigenvalue, divided by the sum of all axis eigenvalues. The "axes" tab of the visualization also has a nice little plot explaining percent variance explained by the top 5 axes.

I think the idea here is that how important one unit of difference is should be scaled by how important the axis is. If PC52 only explains .00001% of the variance in your samples, it doesn't matter much how far apart two points are on that axis. If, on the other hand, PC1 explains 45% of the total variance, differences between samples on PC1 contribute much more relative to the overall variance.

Unfortunately, @fgara, this is where you lose me. This ResearchGate post has a couple resources that might help you with how to interpret loadings in PCoA, but it sounds like they may not carry the same meaning as PCA loadings. I'm not sure they're available at all in QIIME 2 artifacts, but if they were, you'd probably have to extract them from a PCoAResults or PCoAResults % Properties('biplot') Artifact. Maybe someone with more experience with the nuts and bolts of PCoA will weigh in on this.

Best,
Chris