Hi all,
The brilliant @hsapers raised some great questions about how QIIME 2 plots semimetric distances in this User Support topic. As a result, we opened this issue on transforming semimetric data for PCoA plotting. There are a lot of ways we could implement this, and I’d like to use this topic to hash out some of the details.
Initial questions are in headers/boldface below. Opinions welcome!
Background
When noneuclidean distance matrices (e.g. those produced with semimetric measures like BrayCurtis) are used in PCoA calculations, negative eigenvalues may result in the production of complex values which can’t be represented meaningfully in a PCoA plot. When the magnitude of these values is small, the axes represented by the PCoA plot are unaffected. If the magnitude is large, the plot might not be meaningful/interpretable.
There is a great explanation of the math (and sources) in this open issue about the related skbio warning.
Negative eigenvalues appear to be problematic for PCoA only when their magnitude is large, and though QIIME 2 passes along the skbio warning that users should check their data, it provides no tools for correcting negative eigenvalues. So here we are!
What transformations should we implement?
Pg. 25 of this Pierre Legendre deck proposes three corrections that can be applied to “fix” the negative eigenvalues:
 take the square roots of the dissimilarities before PCoA
 Lingoes method: add a constant to the squared dissimilarities
 Cailliez method: add a constant to the dissimilarities
The sqrt(D) method does not guarantee success (It works with most data). (This is mentioned in an rvegan issue, and the Legendre deck refers to L&L for a list of applicable measures IIRC.) Lingoes and Cailliez are slightly more complex, but do guarantee “euclidified distances” when supplied an appropriate constant.
Though sqrt(D) scales the data down and the other two methods scale it up, the few results I’ve seen don’t look dramatically different. Why would a user want one correction over another?
Common practice seems to be “implement all three”, and I’m inclined to follow suit unless there’s a clear reason to preference one over the others.
Where/how should we expose this feature?
The simplest approach might be to augment diversity pcoa
and pcoabiplot
with an optional papplytransformation
parameter, allowing the user to optionally transform the DistanceMatrix immediately prior to PCoA, and never exposing a transformed DistanceMatrix. rvegan
handles these transformations similarly, returning only the original “noneuclidified” data.
Plus:
 simpler workflow for users
 simpler changeset
 no new SemanticTypes required
Minus:
 transformed DistanceMatrix isn’t available to other methods.
Adonis may be another candidate for these transformations. They have been implemented in vegan adonis2
, but not adonis
. QIIME 2 uses adonis
. According to the documentation, “both functions can handle semimetric indices (such as BrayCurtis) that produce negative eigenvalues.” I suspect this just means that semimetric indices in adonis
can be used, but may produce negative eigenvalues. Please correct me if I’m wrong.
Are there other methods (extant, or soontobe) that would benefit from our implementing a separate transformation and a DistanceMatrix[Transformed] Semantic Type in QIIME 2 Unless there’s a clear yes, I’m inclined to keep this simple for now.
Possible Pitfalls?

sqrt(D)
may not be applicable to some measures (or in some nonPCoA contexts?)  Lingoes and Cailliez require us to select a constant that guarantees euclideanness. I haven’t looked into how this is done yet, and wonder whether minimality is important. If so, does this present any performance issues?