PCoA results differ between r and QIIME even when using QIIME-produced matrix

naltmank · May 15, 2023, 3:21pm

Hi there,

Not to add to the glut of questions regarding PCoA outputs on QIIME vs R, but I couldn't find one that addresses the issue I'm having. For some background, I have multiple distance matrices that I've produced using phyloseq, rbiom, etc. based on data that I've imported into R from QIIME, and I'd like to analyze/visalize my data with PCoA. When I use the binary jaccard, the results I produce in R are identical to those produced in QIIME. So far, so good. However, when I tried running PCoA on the bray-curtis or unifrac distance matrices, my results started to differ. After some digging, I saw that there's some confusion on how exactly QIIME treats the data prior to computing a bray-curtis distance matrix, and I saw that there are some issues with phyloseq computing unifrac distance matrices. After troubleshooting, I still couldn't replicate the results from QIIME2 in R, so I decided to try importing and analyzing the distance matrices that QIIME2 produced and was (theoretically) using for PCoA. This is where things got weird for me. Even when I use the exact same data that QIIME2 is using (i.e. the distance matrix I imported directly from qiime2 into R), my results still differ. Below is an example showing the results from PCoA using a bray-curtis matrix:

And again showing results from weighted unifrac:

The distances are slight so in the grand scheme of things they probably don't matter, but I'm still not sure how this is possible.

Also, for what it's worth, while I couldn't recreate the bray-curtis matrix that qiime produced in R, the weighted unifrac matrix I produced with rbiom was identical to the one that I ultimately ended up from qiime2 for the comparison above.

If anyone has any thoughts or ideas on what's going on here please let me know!

Cheers,
Noam

colinbrislawn · May 16, 2023, 3:54pm

I think this has to do with the practical implementation of the math 'under the hood' of these algorithms.

For some methods/equations, the math is deterministic and exact: with the same inputs, there is a single right answer and you get this every time.
For other methods, like statistical resampling, the math is an estimate. This can vary slightly depending on inputs or the random seed used at the start.

PCoA is one of these 'best fit' methods that you expect to change slightly. Note how Axis1 is around 40%, but not exactly. Different solvers for eigenvectors can give you different fits. It looks like the solver in R is able to explain more in Axis 1 and 2 than the solver in Python

The big thing to worry about is math that's just WRONG. That bug in Phyloseq's UniFrac is a big problem.

Assumptions of the pipelines can also be wrong. Some phyloseq functions drop data that does not match or is all zero, and this change in put data can also change results.

naltmank · May 16, 2023, 4:23pm

Gotcha, that makes sense to me! I had assumed it was likely something along those lines but don't love saying "black box makes numbers different" without a bit more context.

And agreed about the bug in Phyloseq. Luckily other people found the workaround before me, and the rbiom fix seems to work great.