jaceard vs bray_curtis vs unifrac

devonorourke · May 13, 2019, 9:02pm

Thanks for adding another layer of complexity @jwdebelius - it's important to point out that even when you think you're using a metric as might be described in the literature, that's not necessarily what's under the (Python) hood!

A bit of Googling lead me to this old post where the issue was raised about some diversity measures not running their qualitative assumptions by default. I didn't realize that Jaccard was assumed to be qualitative! That got me to realizing that we've all had this discussion before - back in 2018. I'd like to suggest one amendment to the document you've highlighted previously: which metrics does the QIIME2 implementation assume to be qualitative, and which are quantitative, by default.

I'm all for the user needing to understand how each metric works, but I think the Jaccard distance illustrates a need for a minor change in how this QIIME2 documentation works. Either adding a --binary ... parameter to the function would ensure that the user knows whether or not a metric is assuming a binary transformation, but that likely might change a lot of what's going on with the existing code and perhaps that's something for another day.

I'd love if the developers amended the documentation in the help menu of qiime diversity beta to point out which metrics are treated as qualitative or quantitative. It's already outlined in Greg's issue I linked above, so it'd just be a matter of telling the user which metrics perform a binary transformation by default. Something like:

Parameters:
  --p-metric TEXT Choices('cosine', 'mahalanobis', 'canberra_adkins',
    'yule', 'sokalsneath', 'hamming', 'seuclidean', 'sokalmichener',
    'russellrao', 'rogerstanimoto', 'chebyshev', 'euclidean', 'canberra',
    'sqeuclidean', 'cityblock', 'dice', 'correlation', 'kulsinski',
    'matching', 'wminkowski', 'braycurtis', 'jaccard', 'aitchison')
                       The beta diversity metric to be computed.    [required]
 
## Caution: the following metrics transform the abundance table into binary (presence-absence) format by default:
sokalsneath, yule, jaccard, sokalmichener, kulsinski, rogerstanimoto, dice, matching, russellrao

Seem sensible?