Does it make sense to use weighted unifrac distances for 16s data?

Kevin_Rey1 · December 5, 2019, 11:42pm

This may be a very basic theory question, but I'm having difficulty finding any sort of conclusive information online / in literature.

In Qiime2, we can construct a tree using the MAFFT to FastTree workflow, and the rooted tree that results is what we're using as an input to the weighted unifrac calculation.

What doesn't (yet) make sense to me is how the phylogeny of 16s sequences can add taxonomic information in of themselves. For example, a 16s ASV might only be a few bases different from another, which would cluster more closely in the constructed tree, but the actual taxonomic identities of those two ASVs might be very different.

Any help understanding this or a point to some helpful literature would be greatly appreciated. Thank you so much for your time!

jwdebelius · December 6, 2019, 9:08am

Hi @Kevin_Rey1,

Welcome to the forum! You seem very timely because we've had a lot of discussions about taxonomy, phylogeny, and their relationship recently.

You have hit on one of the fundamental frustrations of modern ecology: phylogeny and taxonomy don't line up. We (hope) that they're close, but most names are based off morphology-based phylogeny/taxonomy and modern molecular-based phylogeny shows that sometimes we got it wrong. There's divergent evolution, things we're not always sure about, and it's difficult. What's worse, we have this problem on a macroscale as well! I had a great (frustrated) discussion with a plant ecologist about this same problem last week. I'm goingt o recommend this post which talks about (some) of this.

(The part that starts from "We can infer taxonomy..." and ends with snark about chickens and dinosaurs is probably most relevant to you.)

Okay, so, if you have monophyletic clade, then the relationship between taxonomy and phylogeny should be closer. (Although not always) and sometimes, we shift names to try and get things up to date with the tree. Have you seen the recent paper about taxonomic updates based on phylogeny? However, most the way we do taxonomy in QIIME is to use a naive baysian classifier which is separate from the phylogeny calculation.

I think (maybe) the place where you're confused is that you're making the assumption that we should build our UniFrac distance based on the taxonomy rather than phylogeny? For UniFrac, we rely on that evolutionary relationship entirely agnostic to taxonomy. So, it's a calculation based on a nameless tree. I could, theoretically, do UniFrac on halloweencandy bars if I have a phylogeny relating them. (BTW, that blog looks like they have generally awesome posts and Im book making them for my "to read" list).

I'll also make the (brief) mention that "weighted" in UniFrac refers to weighting by abundance, rather than weighting by evolutionary distance.

Hopefully this helps untangle some of this? I also recommend looking at some of the previous issues that deal with why multiple ASVs/OTUs have the same name, because that might also shed some light for you?

Best,
Justine

Kevin_Rey1 · December 12, 2019, 8:27pm

Yes, thank you that does really clarify things. I guess there are some caveats about using Unifrac distances with 16s sequences, but I don't really know that using Bray-curtis dissimilarity would be any better for showing (dis)similarity.

Thank you very much for the thorough explanation!

jwdebelius · December 12, 2019, 8:32pm

Hi @Kevin_Rey1,

Bray Curtis and Unifrac (weighted or unweighted) tell you different things about your community. I tend to think of them as complementary and a bit descriptive. If I see a difference in Bray Curtis and not one of my unweighted metrics, it tells me that Im look at something that's most the abundant organism (My current favorite example of this is smoking in the oral cavity). But, maybe I see something that changes in an unweighted metric because of a loss of rare taxa... that tells me something about the community as well.

I like to think about different metrics as different lens on a or . I can only get a complete picture of my system if I look at it through multiple lens. Having both red and blue might mean the difference between understanding a phenomenon and missing something major.

Best,
Justine

Kevin_Rey1 · December 20, 2019, 8:09pm

Thank you for the explanation!

colinbrislawn · December 20, 2019, 8:59pm

Well said!

Here's the / and / metrics I like to use:

name	measure	meaning
Weighted Jaccard / Soergel	weighted, no phylogeny	change in most common microbes
(Binary) Jaccard	unweighted, no phylogeny	change in less common microbes
Weighted UniFrac	weighted, phylogeny	change in most overall composition
Unweighted UniFrac	unweighted, phylogeny	change in less finer niches

I'm getting off-topic, but you can play this game with Alpha diversity metrics too:

name	measure and meaning
Observed	Number of features observed in that sample
Faith's PD	Amount of Phylogenetic Distance covered by features in that sample

Colin

P.S.

Why Jaccard vs Bray Curtis? ⚔️

Bray-Curtis is a dissimilarity measure... while Jaccard is a distance metric. I like distance metrics more, but that choice is up to you!

You can read more about metric vs semimetric here.

No one uses the names consistently, lol. 😹

The very popular vegan package (32k citations) treats all metrics as quantitative, even when Jaccard is explicitly a binary metric. And they admit it:

The quantitative version of Jaccard should probably called Ružička index.