# Distance metric wanted for ranked sets with missing values (beta diversity metric)

Can anyone suggest which distance metric to use for ranked sets of data with missing values? Screenshot below (small snippet) shows columns containing importance ranks (1 is most important) for a set of features.

I want to see which columns are the most similar. So it is essentially a beta diversity analysis. I read that Kendal Tau is suggested for rank-based data so that made me curious about why QIIME2 developers did not include the Kendal Tau metric for the `qiime diversity beta` action? There are 20 diversity metrics and it was not implemented. So is there a reason it was excluded? Is there another metric that would work for me? This is a developer question in a way and a community question so any thoughts/stats advice is appreciated.

A follow up question is how to handle the missing data? Would I impute using the average values or use 0 or a large number? Zero seems wrong in a way because 1 is the most important feature but a large number would skew it. (so maybe the max value + 1)... really unsure about this.

1 Like

Hello Jennifer,

Good to see you again.

I checked Wikipedia for Kendal's Tau, and this page compares it to Spearman's ρ. I think that's implemented in `--p-metric: correlation`

Why? I'm not sure, but I think it's because, in numerical ecology, beta diversity is usually framed as a question of distance (how 'close' are these samples).

Beta diversity is framed as a geometric measure

This geometric analogy also appears in the direction of scales: 0 means close while 1 means far.
(Compare to a correlation, where 1 means similar and 0 means different.)

Under the hood in SciPy, it's in spatial.distance. Distance computations (scipy.spatial.distance) — SciPy v1.11.3 Manual
Kendal's Tau is in stats: scipy.stats.kendalltau — SciPy v1.11.3 Manual

That is complicated (and also controversial!!)

3 Likes

Hi again! Thanks for the help!

So funny/unfortunate that the name Kendal Tau confused me so much. I did not read or see that it was the equivalent of spearman so I thought it had other properties. At least I learned a bit more about distance metrics.

Usually I use spearman when I'm adjusting original values, so it sort of felt more complicated when starting with a set of ranks. I guess I overcomplicated it. Correlation worked. Thank you!

For documentation purposes I'll continue to look into it a bit and report back. Here it's talking about the difference between Pearson, Spearman and Kendall for anyone who might see this post. Chapter 22: Correlation Types and When to Use Them

2 Likes