Can anyone suggest which distance metric to use for ranked sets of data with missing values? Screenshot below (small snippet) shows columns containing importance ranks (1 is most important) for a set of features.
I want to see which columns are the most similar. So it is essentially a beta diversity analysis. I read that Kendal Tau is suggested for rank-based data so that made me curious about why QIIME2 developers did not include the Kendal Tau metric for the qiime diversity beta
action? There are 20 diversity metrics and it was not implemented. So is there a reason it was excluded? Is there another metric that would work for me? This is a developer question in a way and a community question so any thoughts/stats advice is appreciated.
A follow up question is how to handle the missing data? Would I impute using the average values or use 0 or a large number? Zero seems wrong in a way because 1 is the most important feature but a large number would skew it. (so maybe the max value + 1)... really unsure about this.
1 Like
Hello Jennifer,
Good to see you again.
I checked Wikipedia for Kendal's Tau, and this page compares it to Spearman's ρ. I think that's implemented in --p-metric: correlation
Why? I'm not sure, but I think it's because, in numerical ecology, beta diversity is usually framed as a question of distance (how 'close' are these samples).
Beta diversity is framed as a geometric measure
instead of a statistical measure
This geometric analogy also appears in the direction of scales: 0 means close while 1 means far.
(Compare to a correlation, where 1 means similar and 0 means different.)
Under the hood in SciPy, it's in spatial.distance. Distance computations (scipy.spatial.distance) — SciPy v1.12.0 Manual
Kendal's Tau is in stats: scipy.stats.kendalltau — SciPy v1.12.0 Manual
If you can find out more about this history, I would love to learn more!
That is complicated (and also controversial!!)
3 Likes
Hi again! Thanks for the help!
So funny/unfortunate that the name Kendal Tau confused me so much. I did not read or see that it was the equivalent of spearman so I thought it had other properties. At least I learned a bit more about distance metrics.
Usually I use spearman when I'm adjusting original values, so it sort of felt more complicated when starting with a set of ranks. I guess I overcomplicated it. Correlation worked. Thank you!
For documentation purposes I'll continue to look into it a bit and report back. Here it's talking about the difference between Pearson, Spearman and Kendall for anyone who might see this post. https://ademos.people.uic.edu/Chapter22.html
2 Likes