Can anyone suggest which distance metric to use for ranked sets of data with missing values? Screenshot below (small snippet) shows columns containing importance ranks (1 is most important) for a set of features.

I want to see which columns are the most similar. So it is essentially a beta diversity analysis. I read that Kendal Tau is suggested for rank-based data so that made me curious about why QIIME2 developers did not include the Kendal Tau metric for the `qiime diversity beta`

action? There are 20 diversity metrics and it was not implemented. So is there a reason it was excluded? Is there another metric that would work for me? This is a developer question in a way and a community question so any thoughts/stats advice is appreciated.

A follow up question is how to handle the missing data? Would I impute using the average values or use 0 or a large number? Zero seems wrong in a way because 1 is the most important feature but a large number would skew it. (so maybe the max value + 1)... really unsure about this.

1 Like

Hello Jennifer,

Good to see you again.

I checked Wikipedia for Kendal's Tau, and this page compares it to Spearman's ρ. I think that's implemented in `--p-metric: correlation`

Why? I'm not sure, but I think it's because, in numerical ecology, beta diversity is usually framed as a question of distance (how 'close' are these samples).

Beta diversity is framed as a geometric measure

instead of a statistical measure

This geometric analogy also appears in the direction of scales: 0 means close while 1 means far.

(Compare to a correlation, where 1 means similar and 0 means different.)

Under the hood in SciPy, it's in spatial.distance. Distance computations (scipy.spatial.distance) — SciPy v1.11.3 Manual

Kendal's Tau is in stats: scipy.stats.kendalltau — SciPy v1.11.3 Manual

If you can find out more about this history, I would love to learn more!

That is complicated (and also controversial!!)

3 Likes

Hi again! Thanks for the help!

So funny/unfortunate that the name Kendal Tau confused me so much. I did not read or see that it was the equivalent of spearman so I thought it had other properties. At least I learned a bit more about distance metrics.

Usually I use spearman when I'm adjusting original values, so it sort of felt more complicated when starting with a set of ranks. I guess I overcomplicated it. Correlation worked. Thank you!

For documentation purposes I'll continue to look into it a bit and report back. Here it's talking about the difference between Pearson, Spearman and Kendall for anyone who might see this post. Chapter 22: Correlation Types and When to Use Them

2 Likes