Kruskal Wallis (all groups) is non-significant but Kruskal Wallis (pairwise) shows some significance between groups

Hello, I've read through the discussions regarding the differences between the Kruskal-wallis (all groups) vs. Kruskal-wallis (pairwise) statistical outputs in qiime. I'm looking at my faith phylogenetic distance output, and the KW "all groups" output indicates no significant differences in alpha diversity across all groups, however the "pairwise" output indicates significant differences between one group and two of the others. How do I interpret this? Is it appropriate to investigate pairwise contrasts when the all groups analysis yeilds no significant differences?
Here are my results:

Hi @ShawneeK,

Good question, there's not a perfectly clear cut answer to this. But it's worth drawing a few points:

  1. If you are doing exploratory work, then to control false-positives, it's best to look at the global/omnibus test and ignore the pairwise results when the global test is not significant. This is the general omnibus+post-hoc ritual and unless you have a good reason to ignore it, it's probably not worth fighting your reviewers over it.

  2. Your pairwise tests are only significant when ignoring multiple testing, if you look at the FDR corrected p-values (the q-values) you'll notice those aren't significant (at least not at the usual alpha of 0.05). This suggests that your pairwise tests actually generally agree with the global test once you factor that in.

  3. Supposing you had reason a priori to expect a difference in two groups, and did not care about the others (say your main hypothesis was AMF vs Control), then you could use the p-value outright for exactly that one test and ignore every other number (even though BF vs AMF looks alright as well). But you would truly have had to already expect this to be the case before you had looked at any results whatsoever. Otherwise, the p-value doesn't mean what you want it to mean and reporting it would be actively unhelpful.

  4. For Kruskal-Wallis, it is considering the variation of all groups and trying to identify if at least one group does not fit the global variation of all groups. This is different from the pairwise comparisons, which do not consider the other group's variation (at least in our results, other methods can be slightly cleverer than this). In other words, each of these groups is not significantly (<0.05) distinct from a random sample of a larger population. But you can always find two random samples of a larger population which happen to be extreme relative to each other. This is why we use FDR correction in the first place, since that ought to occur at a predictable (and correctable) rate if they were in fact the same population.

Here's a really good discussion I was able to fine on CrossValidated:

Personal soapbox:

I feel like statistics works best when used to demonstrate that a visually obvious pattern is legitimate, rather than using it to detect some non-obvious difference. So I would lead with the plot that shows something interesting, and justify why it is or isn't interesting via some statistic (instead of the other way around).


Thank you so much, I am very new to bioinformatics so I'm exploring every output and trying to understand how to properly interpret. This very much helps to clarify. I wasn't sure if the corrected p values (q) were on the same scale as the p values (alpha 0.05).

I appreciate your help!

1 Like