Hi all,
The brilliant @hsapers raised some great questions about how QIIME 2 plots semi-metric distances in this User Support topic. As a result, we opened this issue on transforming semi-metric data for PCoA plotting. There are a lot of ways we could implement this, and I'd like to use this topic to hash out some of the details.
Initial questions are in headers/boldface below. Opinions welcome!
Background
When non-euclidean distance matrices (e.g. those produced with semi-metric measures like Bray-Curtis) are used in PCoA calculations, negative eigenvalues may result in the production of complex values which can't be represented meaningfully in a PCoA plot. When the magnitude of these values is small, the axes represented by the PCoA plot are unaffected. If the magnitude is large, the plot might not be meaningful/interpretable.
There is a great explanation of the math (and sources) in this open issue about the related skbio warning.
Negative eigenvalues appear to be problematic for PCoA only when their magnitude is large, and though QIIME 2 passes along the skbio warning that users should check their data, it provides no tools for correcting negative eigenvalues. So here we are!
What transformations should we implement?
Pg. 25 of this Pierre Legendre deck proposes three corrections that can be applied to "fix" the negative eigenvalues:
- take the square roots of the dissimilarities before PCoA
- Lingoes method: add a constant to the squared dissimilarities
- Cailliez method: add a constant to the dissimilarities
The sqrt(D) method does not guarantee success (It works with most data). (This is mentioned in an r-vegan issue (edit: issue resolved), and the Legendre deck refers to L&L for a list of applicable measures IIRC.) Lingoes and Cailliez are slightly more complex, but do guarantee "euclidified distances" when supplied an appropriate constant.
Though sqrt(D) scales the data down and the other two methods scale it up, the few results I've seen don't look dramatically different. Why would a user want one correction over another?
Common practice seems to be "implement all three", and I'm inclined to follow suit unless there's a clear reason to preference one over the others.
Where/how should we expose this feature?
The simplest approach might be to augment diversity pcoa
and pcoa-biplot
with an optional --p-apply-transformation
parameter, allowing the user to optionally transform the DistanceMatrix immediately prior to PCoA, and never exposing a transformed DistanceMatrix. r-vegan
handles these transformations similarly, returning only the original "non-euclidified" data.
Plus:
- simpler workflow for users
- simpler changeset
- no new SemanticTypes required
Minus:
- transformed DistanceMatrix isn't available to other methods.
Adonis may be another candidate for these transformations. They have been implemented in vegan adonis2
, but not adonis
. QIIME 2 uses adonis
. According to the documentation, "both functions can handle semimetric indices (such as Bray-Curtis) that produce negative eigenvalues." I suspect this just means that semimetric indices in adonis
can be used, but may produce negative eigenvalues. Please correct me if I'm wrong.
Are there other methods (extant, or soon-to-be) that would benefit from our implementing a separate transformation and a DistanceMatrix[Transformed] Semantic Type in QIIME 2 Unless there's a clear yes, I'm inclined to keep this simple for now.
Possible Pitfalls?
sqrt(D)
may not be applicable to some measures (or in some non-PCoA contexts?)- Lingoes and Cailliez require us to select a constant that guarantees euclidean-ness. I haven't looked into how this is done yet, and wonder whether minimality is important. If so, does this present any performance issues?