jaceard vs bray_curtis vs unifrac

rachel_haupt · May 12, 2019, 11:24am

hello.
Recently I work with qiime2 ,I saw that by order of the core matrix I can have different analyzes which includes: jaceard ,bray-curtis ,unifrac.
After reading I realized what every analysis doing and how.
But I did not understand when I would prefer to use any analysis?
I'd be happy to help
Thank you

jwdebelius · May 12, 2019, 11:31am

Hi @rachel_haupt,

Welcome! I moved this over to general discussion because I think it's probably a better fit for your question.

I like to think of different metrics like difference lens in a camera or microscope. Different lens will show me different things about my data. So, since jaccard distance is an unweighted taxonomic metric (doesnt consider abundance, doesnt consider phylogeny), it simply tells whether or not features are shared. Bray Curtis is similar, but it considers abundance. I tend to like to use Bray Curtis and weighted metrics when I'm worried about the most abundant things in a community. I like unweighted (Jaccard) when I want to give equal weight to rare and abundant organisms. But, my person preference is to look at multiple metrics so I can get a broader view of my data. Maybe I didn't have a clear hypothesis apriori, but when I check in Bray Curtis distance, I find a difference I don't see in another metric. That tells me abundant organisms are driving my difference and then I can test hypotheses from there.

I also want to link this back to an amazing post about all the metrics because I seriously use this one ALL the time and its great.

Best,
Justine

devonorourke · May 13, 2019, 7:36pm

@rachel_haupt if I had 10 cents for every diversity metric I've stumbled across, I'd have ... a lot of dimes. While I found the link that @jwdebelius recommended very useful, that post doesn't illustrate all the different equations each metric is using in one place. When I was getting started with diversity measures this made it challenging to understand which metrics are more similar to each other, and what assumptions I was making in adopting that metric.

For instance, the Jaccard index is not unweighted, nor is the Bray-Curtis metric, so both can be used with abundance data. They are similar in spirit but differ in that one index (Bray-Curtis) weights the abundances of shared species more. To complicate things, both can work on (unweighted) presence-absence data. And if you want to run the Bray-Curtis values on unweighted data, you're technically running a Dice-Sorenesen index.

I've found Anne Choa's writings really clear - this paper helped me a lot in thinking about the appropriateness of a range of beta diversity metrics. You might also find this article helpful insofar as it is quite comprehensive in thinking about all the ways you could evaluate how any two metrics are similar or different from one another.

jwdebelius · May 13, 2019, 8:29pm

Thank you @devonorourke for the awesome papers!

I think its worth noting, though, the scipy (QIIME) implementation of Jaccard is unweighted (in so far as I understand unweighted as considering presence and absence) in that it takes a boolean matrix and does a variant on this calculation

d_{a,b} = 1 - (\frac{A \cap B}{A \cup B})

devonorourke · May 13, 2019, 9:02pm

Thanks for adding another layer of complexity @jwdebelius - it's important to point out that even when you think you're using a metric as might be described in the literature, that's not necessarily what's under the (Python) hood!

A bit of Googling lead me to this old post where the issue was raised about some diversity measures not running their qualitative assumptions by default. I didn't realize that Jaccard was assumed to be qualitative! That got me to realizing that we've all had this discussion before - back in 2018. I'd like to suggest one amendment to the document you've highlighted previously: which metrics does the QIIME2 implementation assume to be qualitative, and which are quantitative, by default.

I'm all for the user needing to understand how each metric works, but I think the Jaccard distance illustrates a need for a minor change in how this QIIME2 documentation works. Either adding a --binary ... parameter to the function would ensure that the user knows whether or not a metric is assuming a binary transformation, but that likely might change a lot of what's going on with the existing code and perhaps that's something for another day.

I'd love if the developers amended the documentation in the help menu of qiime diversity beta to point out which metrics are treated as qualitative or quantitative. It's already outlined in Greg's issue I linked above, so it'd just be a matter of telling the user which metrics perform a binary transformation by default. Something like:

Parameters:
  --p-metric TEXT Choices('cosine', 'mahalanobis', 'canberra_adkins',
    'yule', 'sokalsneath', 'hamming', 'seuclidean', 'sokalmichener',
    'russellrao', 'rogerstanimoto', 'chebyshev', 'euclidean', 'canberra',
    'sqeuclidean', 'cityblock', 'dice', 'correlation', 'kulsinski',
    'matching', 'wminkowski', 'braycurtis', 'jaccard', 'aitchison')
                       The beta diversity metric to be computed.    [required]
 
## Caution: the following metrics transform the abundance table into binary (presence-absence) format by default:
sokalsneath, yule, jaccard, sokalmichener, kulsinski, rogerstanimoto, dice, matching, russellrao

Seem sensible?