beta diversity explanation (jaccard_distance)

terren · June 3, 2020, 5:09am

Hi , I did beta diversity analysis using jaccard_distance metric. But I am confused the result.
According to the table, there is a significant difference between NF and PF groups, and the distance from NF to PF is 1 according to the left upper graph. They are consistent. However, the distance from NL to PL is also 1 from the lower right graph which means two group shar no species, but P value is more than 0.05 from the table which means there is no significant difference between NL and PL.

Mehrbod_Estaki · June 3, 2020, 9:47pm

Hi @terren,
The Jaccard distance tells you how similar 2 samples are to each other based on the absence/presence of taxa. So a value of 1 is telling you that 100% of all the taxa you found in one sample (or group in this caes) was found in another.
The p-values you are looking at our unadjusted, you'll want to look at the q-values instead which are the p-values but adjusted for multiple testing. Based on those it suggests that there are no differences between your groups, which also supports what the plots are showing.
Hope that clarifies it a bit.

terren · June 3, 2020, 11:04pm

I am confused. I got the Jaccard distance meaning from Metagenomics - x,
Jaccard distance

based on presence or absence of species (does not include abundance information)
different in microbial composition between two samples
0 means both samples share exact the same species
1 means both samples have no species in common

Is it wrong?

Mehrbod_Estaki · June 3, 2020, 11:29pm

Hi @terren,
It depends on the implementation of the scoring. Some tools use the score as is, some use a 1-x approach. Let's see what the QIIME 2 does, using 2 identical matrices and calling the jaccard_sore from scikit learn:

import numpy as np
from sklearn.metrics import jaccard_score
x = np.array([[1, 1, 1],
                   [1, 1, 1]])
y = np.array([[1, 1, 1],
                   [1, 1, 1]]) 
jaccard_score(x[0], y[0])

1.0

So, as you can see a score of 1 means the 2 tables are identical.

I know it can be confusing, and I think it would be ideal for tools to explain their implementation in the documentations (not just QIIME 2, I find this to be an issue with most packages I use in R as well).

terren · June 4, 2020, 4:10am

Thanks for your quick reply. But I am still puzzled about the following P_value and q_value. Which thresholds can be used to the criterion of significant differrence?

And could you tell me the distance meaning of bray_curtis_distance,weighted_unifrac_distance,unweighted_unifrac_distance? I am confused the three following graphs. Whether have the groups significant differrence in each graph?

Mehrbod_Estaki · June 4, 2020, 4:25am

Hi @terren,
Those are simply the p-values and adjusted-p values. Common convention sets the cut off at p<0.05, so while your groups may appear to be different based on the unadjusted tests, this is not the case when you (appropriately) account for multiple testing (q-value). I would stick with the q-value and accept the null hypothesis.

terren · June 4, 2020, 4:33am

Thanks.I have edited the question.Could you help me?

Mehrbod_Estaki · June 4, 2020, 5:26am

Hi @terren,
In the future, please don't edit previous posts to add new questions, it's a lot easier to follow a thread when all the replies are in chronic order.

It looks like you already asked this question on another thread and had an excellent response there. Please re-read the answer there from @ChrisKeefe.

There is a great thread here that should help interpreting beta group-significance boxplots

I believe I already answered this above as well. Based on the adjusted p values (labelled: q-value) you are accepting the null hypothesis which would suggest you are unable to detect differences between your groups.

terren · June 4, 2020, 6:00pm

Thanks for you very much. I am sorry to edit. When I posted my question, I found I have not ask clear. So I edited. After I edited and posted, I found you have replied. I am so sorry to trouble you.

The distance value of bray_curtis_distance,weighted_unifrac_distance,unweighted_unifrac_distance are more, the dissimilarity (or similarity) is more (or less)? I am still confused.

Mehrbod_Estaki · June 4, 2020, 8:14pm

Hi @terren,
No problem.

You are looking at 3 different distance matrices, which have different biological interpretations. Chris gave you links to the definition of those matrices in another post.
And I already provided you with a link that explains what the boxplots mean exactly. Finally, we discussed that according to your q-values, your groups are NOT different from each other. So I'm not really sure which part you are referring to when you say you are still confused, can you be more specific please.

terren · June 4, 2020, 10:37pm

Thanks for your reply. For example as the following graphs, I know there is no differrence between groups. But I could understand the value of vertical coordinates of each box.When n=15 in the lower left graph , what is the meanning of 0.2 and 1.8 between PF and PF ?

Mehrbod_Estaki · June 5, 2020, 9:23am

Hi @terren,
Ok, take a look at the diagram below I made (values are just made up)

What beta-group-significance is doing:

calculate within group distances
In our case look at the green circles and the table beside it
There are 3 circles (samples) so there are 3 distances between them. Same within the red group.
But blue group has 4 samples so there are 6 distances within that group (no table shown).
Calculate across group distances
Here look at the distances between the red samples to the green samples.
You can see that there are 9 distances there (dashed lines) and those lines are longer than the within group distances because the samples are further.

Now, what the null hypothesis for beta-group significance (default permanvoa) is that the distances within a group (say Green here) are similar to those distances across another group (to red in our example).

Let's visualize this in the boxplot:
This boxplot is 1 of 3 we need to make for this example, but just showing distances-to-green. Here we are showing distances to the Green group. So, green-to-green distances are relatively short (n=3, look at 3 green lines). Red-to-green distances are larger because they are further away (n=9, look at 9 dashed lines). And finally Blue-to-green distances look about the same as the red distances to green distances. So our stats (not shown here) would probably suggest that both the red and blue groups have significantly larger distances to the within-green distances.

In your example, all the distances look about similar, suggesting there are no group clustering/differences.

Hope this clarifies things for you.

terren · June 5, 2020, 8:42pm

I got it. I am very appreciated with you.