Transforms for correcting negative eigenvalues for PCoA on semi-metric distances

ChrisKeefe · December 23, 2020, 2:03am

Hi all,
The brilliant @hsapers raised some great questions about how QIIME 2 plots semi-metric distances in this User Support topic. As a result, we opened this issue on transforming semi-metric data for PCoA plotting. There are a lot of ways we could implement this, and I'd like to use this topic to hash out some of the details.

Initial questions are in headers/boldface below. Opinions welcome!

Background

When non-euclidean distance matrices (e.g. those produced with semi-metric measures like Bray-Curtis) are used in PCoA calculations, negative eigenvalues may result in the production of complex values which can't be represented meaningfully in a PCoA plot. When the magnitude of these values is small, the axes represented by the PCoA plot are unaffected. If the magnitude is large, the plot might not be meaningful/interpretable.

There is a great explanation of the math (and sources) in this open issue about the related skbio warning.

Negative eigenvalues appear to be problematic for PCoA only when their magnitude is large, and though QIIME 2 passes along the skbio warning that users should check their data, it provides no tools for correcting negative eigenvalues. So here we are!

What transformations should we implement?

Pg. 25 of this Pierre Legendre deck proposes three corrections that can be applied to "fix" the negative eigenvalues:

take the square roots of the dissimilarities before PCoA
Lingoes method: add a constant to the squared dissimilarities
Cailliez method: add a constant to the dissimilarities

The sqrt(D) method does not guarantee success (It works with most data). (This is mentioned in an r-vegan issue (edit: issue resolved), and the Legendre deck refers to L&L for a list of applicable measures IIRC.) Lingoes and Cailliez are slightly more complex, but do guarantee "euclidified distances" when supplied an appropriate constant.

Though sqrt(D) scales the data down and the other two methods scale it up, the few results I've seen don't look dramatically different. Why would a user want one correction over another?

Common practice seems to be "implement all three", and I'm inclined to follow suit unless there's a clear reason to preference one over the others.

Where/how should we expose this feature?

The simplest approach might be to augment diversity pcoa and pcoa-biplot with an optional --p-apply-transformation parameter, allowing the user to optionally transform the DistanceMatrix immediately prior to PCoA, and never exposing a transformed DistanceMatrix. r-vegan handles these transformations similarly, returning only the original "non-euclidified" data.

Plus:

simpler workflow for users
simpler changeset
no new SemanticTypes required

Minus:

transformed DistanceMatrix isn't available to other methods.

Adonis may be another candidate for these transformations. They have been implemented in vegan adonis2, but not adonis. QIIME 2 uses adonis. According to the documentation, "both functions can handle semimetric indices (such as Bray-Curtis) that produce negative eigenvalues." I suspect this just means that semimetric indices in adonis can be used, but may produce negative eigenvalues. Please correct me if I'm wrong.

Are there other methods (extant, or soon-to-be) that would benefit from our implementing a separate transformation and a DistanceMatrix[Transformed] Semantic Type in QIIME 2 Unless there's a clear yes, I'm inclined to keep this simple for now.

Possible Pitfalls?

sqrt(D) may not be applicable to some measures (or in some non-PCoA contexts?)
Lingoes and Cailliez require us to select a constant that guarantees euclidean-ness. I haven't looked into how this is done yet, and wonder whether minimality is important. If so, does this present any performance issues?

jwdebelius · December 24, 2020, 4:47am

As a quick comment, I'm all for the switch for adonis to adonis2. Adonis2 has some shiny new features and fewer constraints than the adonis implementation.

Jari_Oksanen · September 28, 2021, 12:19pm

Sorry for dropping into this old issue (I don't follow this list, but Google brought this up), and I don't know if the issue is still valid. The Cailliez and Lingoes "corrections" are pretty simple, and they are implemented in vegan:::addLingoes and vegan:::addCailliez functions; you can look at the code there and freely copy the functions for your development if needed. These are minimal implementations intended for internal use and therefore make no check of input and have a simple UI, and only return the constant that must be applied to the dissimilarities. In Cailliez this is a simple additive constant d + ac added to dissimilarities d , and in Lingoes this is a bit more involved sqrt(d^2 + 2 * ac). Both of these "corrections" are implemented in vegan distance-based functions wcmdscale (for PCoA), dbrda, adonis2, betadisper, and varpart (which also works with dissimilarities). They work quite nicely with semimetric indices, but I would be suspicious for applying them with strongly non-metric dissimilarities.

ChrisKeefe · September 30, 2021, 6:59pm

Thank you for joining us, @Jari_Oksanen! This issue is still open, and the resources you've shared will be helpful. This will probably get some developer attention in the next couple of releases.