Jaccard distnace vs Jaccard similarity

kam · August 21, 2023, 10:13am

Hello,

I've noticed several forum posts discussing this issue, and it still appears somewhat confusing. The interpretation of Jaccard distance in R's vegan package also seems problematic, but for a different reason (see how many people have computed jaccard distances incorrectly using vegdist? · Issue #153 · vegandevs/vegan · GitHub).

For instance, in this earlier post (jaceard vs bray_curtis vs unifrac - #3 by devonorourke), it suggests that Jaccard distance in QIIME2 is calculated as a dissimilarity metric (higher value indicates less similarity between samples), while this post (beta diversity explanation (jaccard_distance)) suggests the opposite.

So, I have two questions regarding this matter:

What is the implementation of Jaccard in QIIME2, and is it consistent across all functions that use Jaccard, such as diversity, core metrics, and diversity lib?
More broadly, do all the distance metrics in the diversity plugin (as well as the UniFrac methods) represent dissimilarity (where a higher value indicates less similarity between samples)?

devonorourke · August 21, 2023, 12:45pm

Hi @kam ,
I'm not certain, so hopefully an admin can confirm, but I believe the current way non-phylogenetic beta diversity distances are calculated is using the sklearn.metrics.pairwise_distances Scikit Learn library. For example, the q2-diversity function that runs to create a distance matrix using the Jaccard method has an argument for pairwise func = sklearn.metrics.pairwise:

def jaccard(table: biom.Table, n_jobs: int = 1) -> skbio.DistanceMatrix:
    counts = table.matrix_data.toarray().T
    sample_ids = table.ids(axis='sample')
    return skbio.diversity.beta_diversity(
        metric='jaccard',
        counts=counts,
        ids=sample_ids,
        validate=True,
        pairwise_func=sklearn.metrics.pairwise_distances,
        n_jobs=n_jobs
    )

If I'm interpreting the sklearn function correctly, I believe your interpretation is correct: higher individual values indicate higher distances - and therefore dissimilraities - between groups.

I believe that the same Jaccard call would be used across the QIIME2 platform, but I do not know that it is necessarily true that all non-phylogenetic distance metrics are obtained from Scikit Learn (and perhaps, I don't know if that would even be a realistic expectation, should additional distance measures be created outside of the Scikit Learn library). What I can tell you is that I'd keep looking for any beta diversity calculation within the q2-diversity-lib QIIME2 repository within the beta.py script, at least as a start.

system · September 21, 2023, 6:46pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.