transformations prior to diversity plug in distance calculations

hsapers · December 7, 2020, 3:49pm

Hello,

I'm unsure what data transformations are being calculated prior to the distance matrix generation.

In qiime diversity core-metrics-phylogenetic it looks like weighted and unweighted unifrac distances are calculated along with Bray-Curtis and Jaccard and then PCoA matrices of each. It also looks like 'FeatureTable[Frequency]` is used as input.

My understanding is that for both Jaccard and unweighted unifrac, the input feature table would be converted to presence/absence data first. For Bray-Curtis, if a PCoA is being calculated, I'm assuming that the FeatureTable is converted to at least abundance data, as BrayCurtis on count data would violate the triangle inequality (and when Bray Curtis is calculated on abundance data it becomes analogous to the L1 distance?)

For the qiime diversity beta option, possible distance metrics also include the Aitchison (I'm assuming the clr or ilr is calculated on the FeatureTable first), among several others.

I have been performing a custom transformation (version of the Hellinger) and would like to calculate Jaccard, Bray-Curtis, and weighted and unweighted unifrac on that data. I re-imported the data into qiime and coerced it into the FeatureTable[Frequency] semantic type and the command runs fine, but I'm not convinced it's not doing additional transformations prior to Bray-Curtis and weighted unifrac (I'm assuming it is converting to presence/absence for Jaccard and unweighed unifrac).

Sorry this is a bit long winded, in summary, for commands in the Diversity plug-in that return various distances using FeatureTable[Frequency) as input, are the data transformations required for each distance wrapped into the function? If so, is there a command (or set of commands) that will calculate the distance with no pre-transformation (input transformed data).

Thank you

jwdebelius · December 7, 2020, 5:18pm

Hi @hsapers,

My recommendation would be for you to explore qiime diversity beta and qiime diversity phylogenetic. These let you calculate distance matrices independently. There are some transforms wrapped into the metrics (unweighted UniFrac and jaccard do the transform in the distance calculation and Aitchinson wraps the CLR transform.) The others take the table you pass.

I believe the current Bray Curtis implementation wraps the scipy pdist function about 8 levels in, so it may be worth exploring their docs for the exact calculating of underlying values. It looks like its the sum of the union over the intersection on the counts in the base calculation.

Best,
Justine

hsapers · December 8, 2020, 5:47pm

Thanks @jwdebelius

Looking at pdist is looks like the Bray Curtis distance between samples u and v is calculated by:
(1) d(u,u') = \frac{\sum_i \mid u_i - u'_i \mid}{\sum_i \mid u_i + u'_i \mid}

My linear algebra isn't great but I think this is equivalent to:

(2) d(u,u') = \frac{\sum_i \mid u_i - u'_i \mid}{u^+ + u'^+ }
Where u^+ and u'^+ are the row sums for samples u and u'

So if relative abundance was used as input (2) would reduce to:
(3) d(u,u') = \frac{1}{2}\sum_i \mid u_i - u'_i \mid

with (3) being equivalent to half of the L_1 distance (the sum of the absolute differences between two vectors) which can be generalized :
(4) d_{ii'} (p)= (\sum_{j=i}^J \mid x_{ij} - x_{i'j} \mid ^p) ^ \frac{1}{p}
The distance L_p where p=1 this is the Manhattan distance, where p=2 this is the euclidean distance.

I hope this makes sense, if this is correct, and the qiime diversity beta and qiime diversity phylogenetic called Y = pdist(X, 'braycurtis') with no additional transformation wrapped in, wouldn't then calculating the PCoA from the resulting matrix be inadmissible or least subject to uninterpretable results, given that the Bray-Curtis dissimilarity is only a true distance if calculated on relative abundance and would violate the triangle inequality when calculated on count data?

jwdebelius · December 10, 2020, 4:53pm

Hi @hsapers,

Thanks for your patience with the response. There's been a fair bit of disucssion under the hood with the mod team linear algebra and beta diversity experts (@yoshiki and @Nicholas_Bokulich), and some code checking. to be honest, Nick did most of the legwork here...

The implementation we've got is the dissimilarity metric, which yes is horrendiously confusing and needs to be documented better. But..

That’s why it’s more correctly called a dissimilarity statistic, not a distance metric… it does not satisfy the triangle inequality. So it is not inadmissable, it’s just Bray Curtis.

We have confirmed that it does not in q2-diversity… so @ChrisKeefe must have been aware of the issue of RA transformation prior to BC dissimilarity, and nipped it in the bud:

github.com

qiime2/q2-diversity-lib/blob/1867d88619465a0c9501477ab2c43d640bab55df/q2_diversity_lib/plugin_setup.py#L141


      
              description="Compute Shannon's Entropy for each sample in a "
                          "feature table",
              citations=[citations['shannon1948communication']]
          )
          
          
          # ------------------------ beta-diversity -----------------------
          # TODO: Augment citations as needed
          plugin.methods.register_function(
              function=beta.bray_curtis,
              inputs={'table': FeatureTable[Frequency]},
              parameters={'n_jobs': Int % Range(1, None) | Str % Choices(['auto'])},
              outputs=[('distance_matrix', DistanceMatrix)],
              input_descriptions={
                  'table': "The feature table containing the samples for which "
                           "Bray-Curtis dissimilarity should be computed."},
              parameter_descriptions={'n_jobs': n_jobs_description},
              output_descriptions={
                  'distance_matrix': "Distance matrix for Bray-Curtis dissimilarity"},
              name="Bray-Curtis Dissimilarity",
              description="Compute Bray-Curtis dissimilarity for each sample in a "

I am also not aware that it is inadmissable to use dissimilarities (rather than distances) as input to PCoA… according to Legendre, using non-Euclidean dissimilarity matrices (e.g., with Bray-Curtis) could merely lead to negative eigenvalues, and that Legendre + Legendre recommend some procedures to correct negative eigenvalues if they are giving you indigestion. If that’s all Legendre^2 has to say on the matter, then I think that implies that PCoA of Bray-Curtis dissimilarities is not in itself problematic, only that the eigenvalues may need to be adjusted to properly interpret:

The presence of negativeeigenvalues in the PCoA solution is the criterion to recognize non-Euclideandissimilarity matrices… methods are available to eliminate negative eigenvalues produced during PCoA of a non-Euclidean dissimilarity coefficient. - Pierre Legendre

Negative eigenvalues can be produced in PCoA when decomposing distance matrices produced by coefficients that are not Euclidean (Gower and Legendre 1986, Legendre and Legendre 1998). - R Documentation

The scikit-bio developers (@yoshiki) must similarly have been aware of this solution, as they note in their documentation:

If the distance is not euclidean (for example if it is a semimetric and the triangle inequality doesn’t hold), negative eigenvalues can appear. There are different ways to deal with that problem (see Legendre & Legendre 1998, S 9.2.3), but none are currently implemented here. However, a warning is raised whenever negative eigenvalues appear, allowing the user to decide if they can be safely ignored. - skbio documentation

So this has been recognized for a very long time… but this has not stopped anyone (even Legendre) from using Bray-Curtis dissimilarities + PCoA, or suggesting that it is “inadmissable” to do so. But I agree with @yoshiki that better documentation would not be a bad thing. Alternatively, just implement square root transformation or another one of the fixes recommended by Legendre in q2-diversity-lib for all non-metric statistics (e.g., Bray-Curtis)

hsapers · December 10, 2020, 8:39pm

@jwdebeliusm, @yoshiki and @Nicholas_Bokulich - thank you so much for the detailed reply. I'm assuming that

We have confirmed that it does not in q2-diversity… so @ChrisKeefe must have been aware of the issue of RA transformation prior to BC dissimilarity, and nipped it in the bud:

means that q2-diversity does not calculate relative abundance and using what ever values are in the input FeatureTable[frequency] input.

I'm still a little confused about using a dissimilarity rather than a distance matrix for PCoA. My understanding is that for PCoA, dissimilarities (distances) between objects are preserved and with a non-metric input matrix that does not satisfy the triangle inequality, those 'distances' as represented in the PCoA may not be relatively true to each other since they don't represent true distances to begin with (can meaningless, or at least non-intuitive in Euclidean space coordinations occur without negative eigenvalues? Or is that a definitive test for if a non-metric distance matrix can be accurately rendered in euclidean space without distortion?). I thought that NMDS (relying on rank orders rather than euclidean distances) was a better choice for representing dissimilarities since there's no assumed meaning to the absolute ordination position (dissimilarities are not preserved) and the offset between the ordination and the 'true position' is measured by the stress (my understanding is that there's no equivalent to 'stress' for PCoA other than the amount of variance explained by the axes chosen to represent the ordination, but that there's no measure of how representative that ordination actually is because it's trying to represent a non-metric matrix with a metric coordinate analysis). e.g. I'm taking the 4th root of the relative abundance - so this should preclude negative eigenvalues (I think one of the Legendre^2 methods) - does this mean that the resulting PCoA co-ordination is accurately represented in Euclidian space (the initial D matrix required distortion (√ D) for it to satisfy the Euclidean criteria, but the new transformed matrix itself is now metric?)

I actually found this really helpful. Sorry this is a bit long - thinking my way through things. I guess my outstanding question is:

If a semimetric matrix (eg Bray Curtis) is input into a PCoA, and negative eigenvalues are not produced, is that indicative of that particular Bray Curtis dissimilarity matrix having an euclidean property such that the dissimilarity matrix can be fully represented in a Euclidean space without distortion (or is the correction to prevent negative eigenvalues itself that distortion)? Or is the converse not true: Negative eigenvalues can result from distance matrixes not accurately represented in Euclidean space, but not all distance matrices that cannot be accurately represented in Euclidean space produce negative eigenvalues

thanks - and apologies for stirring up discussion....

Nicholas_Bokulich · December 11, 2020, 9:28am

Correct. cc:ing @ChrisKeefe in case he knows of any transformations that are occurring, but I have searched down the q2-diversity -> skbio -> sklearn -> scipy stack to confirm that none of these are transforming the data.

It might be worth reading Legendre^2's "Numerical Ecology" to sort out the specifics, but that sounds reasonable. Scikit-bio spits out a warning that may be relevant here for deciding if you need to transform the inputs as described above:

The result contains negative eigenvalues. Please compare their magnitude with the magnitude of some of the largest positive eigenvalues. If the negative ones are smaller, it’s probably safe to ignore them, but if they are large in magnitude, the results won’t be useful. See the Notes section for more details.

I have heard that as well. Unfortunately, QIIME 2 does not have an NMDS method right now (please add one if you want to contribute! It has been on the wish list for a while). But that said, it seems like there is nothing wrong per se with using PCoA on BC dissimilarities, provided the eigenvalues look okay.

From Legendre, it sounds like the (e.g., sqrt) transformation makes BC satisfy triangle inequality. Without the transformation it sounds like (from Legendre) that it is semi-metric, and (from skbio) you need to look at the eigenvalues to determine just how non-metric the results are, and whether you trust the representation in euclidean space.

I think another important thing to consider here: PCoA is being used as a soft modeling technique (most of the time) and so this boils down to "can you trust your eyes" in making qualitative interpretations, since usually this is not being used to test for differences... statistical tests for beta diversity differences (e.g., permanova) are not based on the ordination, and come with their own test assumptions etc but are generally compatible with both dissimilarity statistics and distance metrics.

In my reading for the above, I also stumbled across some information relevant to this earlier statement you made:

Sounds like Hellinger transformation prior to PCoA is discouraged:

We recommend not to use PCoA to produce ordinations from the chord, chi-square, abundance profile, or Hellinger distances. It is easier to first transform the community composition data using the following transformations, available in the decostand function of the vegan package, and then carry out a principal component analysis (PCA) on the transformed data. - R documentation

Just something else to consider in case thinking about eigenvalue interpretation off of semi-metric matrices has not been mind-warping enough already

Thanks for stirring up discussion... apologies I'm not a real statistician

hsapers · December 11, 2020, 4:48pm

Thanks again for taking the time to go through things - these are the take aways (I think):

B-C is a semi-metric and when used for PCoA, negative eigenvalues can be produced. There are a number of ways of dealing with this including sqrt(D) - see Legendre^2’s “Numerical Ecology”
In general NMDS (relying on rank orders rather than euclidean distances - like PCA, PCoA, MDS do) is a better choice for representing dissimilarities (like BC that are not metric - hence ' non-metric multidimensional scaling') since there’s no assumed meaning to the absolute ordination position (dissimilarities are not preserved) and the offset between the ordination and the ‘true position’ is measured by the stress.
If PCoA on a semi-metric does not produce negative eigenvalues, it's probably an OK representation.
all ordinations are wrong, some are useful

Sounds like Hellinger transformation prior to PCoA is discouraged:

I'm trying to sort out in my mind the practical differences of taking the root prior to calculating the dissimilarity matrix (Hellinger) or taking the root of the dissimilarity (distorting the semi-metric to avoid negative eigenvalues.

My interpretation of the R documentation:

We recommend not to use PCoA to produce ordinations from the chord, chi-square, abundance profile, or Hellinger distances. It is easier to first transform the community composition data using the following transformations, available in the decostand function of the vegan package, and then carry out a principal component analysis (PCA) on the transformed data:

Chord transformation: decostand(spiders,"normalize")

Transformation to relative abundance profiles: decostand(spiders,"total")

Hellinger transformation: decostand(spiders,"hellinger")

Chi-square transformation: decostand(spiders,"chi.square")

The ordination results will be identical and the calculations shorter. This two-step ordination method, called transformation-based PCA (tb-PCA), was described by Legendre and Gallagher (2001).

Is if using a 'Hellinger distance' (I'm not actually sure what that is in this context - as far as I know Hellinger is just a transformation (√ (relabund) and not a distance metric) it's recommended to preform the Hellinger transformation on the count data and then preform the ordination on the transformed data in two steps - still unclear when the distance/dissimilarity matrix is calculated. So I'm assuming here that there is an option in the R implementation to calculate the distance (which must include 'Hellinger distance' which I think would just wrap the transformation into the distance calculation) wrapped into the ordination function?

Just something else to consider in case thinking about eigenvalue interpretation off of semi-metric matrices has not been mind-warping enough already

Oh yes - currently using ADONIS permutations for mind-warping...

Nicholas_Bokulich · December 11, 2020, 7:08pm

All sound reasonable, with an amendment:

From the skbio documentation/warning message, it sounds like this is rather if the negative eigenvalues are not too negative... so the magnitude matters. In skimming I could not find anything in Legendre/google to support or refute this... so it sounds like the skbio documentation proposes staring at the eigenvalues to decide just how distorted the ordination is?

Yep, sounds like hellinger transformation --> euclidean distance == hellinger distance

Since you put together a nice summary of the conversation above, I will add a summary of open development issues related to the conversation:

implement NMDS in q2-diversity
implement sqrt(D) or other methods to transform semi-metric dissimilarities prior to PCoA.

Have fun ordinating

ChrisKeefe · December 15, 2020, 10:58pm

I'm mighty late to this party, but thanks for starting an awesome conversation, everyone! I learned some stuff here. And thanks for the link to that Legendre deck @hsapers - it's full of goodies!

Before we leave this topic behind, @hsapers, I want to make sure I've got a clear picture of the situation with Bray-Curtis. We currently have an open issue to allow bray-curtis to use FeatureTable[RelativeFrequency] types. I believe you're suggesting that BC is problematic when working with Frequency data. I just want to make sure I read you right, and you're not raising any special concerns about RelativeFrequency data with bray-curtis.

On a more clerical note, here's the standing NMDS issue, and a new issue to address the need for pre-PCoA transforms of semi-metric dissimilarities. If anyone's interested in contributing these, we'd love to help you get started.

hsapers · December 16, 2020, 3:21pm

Thanks all - this has been great

@ChrisKeefe - my current understanding is that if Bray-Curtis is run on untransformed count data (FeatureTable[Frequency]) subsequent metric ordination may produce unintuitive results including negative eigenvalues and complex numbers. I think (from the algebra above, although I find it interesting I can't find this discussed in detail in the literature) if you run Bray-Curtis on relative abundance data it reduces to become analogous to the Manhattan (city-block, L_1, ... distance) and actually becomes a true distance. It would be helpful if there was an option to rareify or transform prior to beta-diversity. Perhaps allowing FeatureTable[Relative Frequency] and/or allowing for an input formula to transform the count data - not sure if this would require a new class FeatureTable[TransformedFrequency] or FeatureTable[TransformedRelativeFrequency]. Sounds like this may upset some of the other distances though - for example it sounds like the CLR is wrapped into the --p-metric aitchison for qiime diversity beta. I think allowing the user to work with transformed data would be pretty powerful. I pull into python and coerce the transformed relative abundance data back into the FeatureTable[Frequency] semantic type. It's a simple work around but disrupts provenance. What I don't think we closed the look on was if a non-metric like the Bray Curtis dissimilarity was used in a metric ordination such as PCoA, and no negative eigenvalues resulted, can that then be assumed to be an accurate representation of the dissimilarities in Euclidean space, or are there other, more subtle, features diagnostic of trying to represent a non-metric/semi-metric in Euclidean space that may be difficult to detect but would render the ordination non-representative.

I would be interested in helping to address these - but I'm not sure I have enough/the right programming experience.

ChrisKeefe · December 17, 2020, 3:21am

Awesome! Thanks for confirming.

Options exist to rarify and transform tables to RelativeFrequency, which can be applied prior to beta-diversity analysis. Maybe this was a typo? It's been a long and knotty discussion!

I suspect you're right about this. Some discussion would need to be had about the semantics, but once that's worked out, something resembling a FeatureTable[TransformedFrequency] type could probably be registered in q2-types.

One of the neat things q2-diversity-lib lets us do is constrain diversity metrics independently. Right now, most diversity measures are still constrained (by default, as it were), to accept only FeatureTable[Frequency] data. The most commonly used measures have their own implementations, generally with more permissive constraints. Setting Bray-Curtis up to take RelativeFrequency (or TransformedFrequency data in the future) mostly consists of adding the relevant types to the Method's registration, and testing the method properly to ensure the results are as expected. The hardest part (for me at least) is usually understanding the method well enough to know what data it can and can't use safely.

Shoot me a DM if you'd like to try your hand at adding some code, @hsapers, and we can talk about what kind of support you might need to make that possible. I'd be happy to help out however I can.

Chris

Nicholas_Bokulich · December 17, 2020, 6:56am

I agree that adding more transformations, and setting up q2-diversity-lib to handle these as appropriate, is a great idea. I think one thing to work out is whether stand-alone transformations are worth it or if these should just be transformation options wrapped into individual (or all) methods — it is a question of how many metrics each transformation is compatible with, and also whether these transformations would be used with other plugins.

These transformations would be relatively easy to write since they are already mostly available in other packages (skbio, scipy) — a lot of the hard work is figuring out (a) what transformations to include and (b) all the (in)compatibilities and (c) thinking through the semantic types etc... so this will take a lot of brain power outside of programming. It would be great to have your help in systematically thinking through this! But if you have any python programming experience (or want to learn) this would also be a great place to start!

system · January 17, 2021, 12:56pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.