ASVs driving the differences behind distance matrix

Biancabrown · May 18, 2018, 1:19pm

Hi All,

Thanks again for this program. Quick question: How do I extract/identify the ASVs that are driving the differences seen among sample type in a unifrac distance matrix?

ebolyen · May 18, 2018, 3:23pm

That's a fantastic question @Biancabrown, and to cut to the chase: there's no good answer.

(This is actually something I'm working on right now in the context of alpha diversity, but the idea is the same.)

There are a couple layers to the problem:

Assuming you mean differences among "samples between sample types" and not "samples within a sample type", we could imagine partitioning the distance matrix into just the distances between sample types.

This isn't a NxN matrix anymore, as you have mutually exclusive ID-sets on either axis (as a sample can't belong to two different sample types [I hope]). This isn't exactly a QIIME 2 distance matrix anymore and there certainly aren't any actions which can do this (or types to represent it, not that it isn't a good idea, we just don't have anything for this).

In any case, supposing there was an easy way to partition, you are still stuck with the beta-diversity calculation which effectively collapses all of your ASVs between the two samples into a tidy little number. This is basically useless if your goal is to talk about ASVs.

To unpack that you would need to effectively "destructure" the UniFrac calculation. One way that comes to mind is to start dropping ASVs and seeing which ones "dramatically" (for some definition of dramatic) changes your UniFrac value.

It's also quite likely that no particular ASV "dramatically" changes the score and so it's some composite effect. In fact I'm almost certain this is what you'll see as UniFrac is sort of "stabilized" by the phyologenetic tree. So the impact of any one ASV is unlikely to really change the outcome, unless this ASV was a wildly different outgroup from the rest of the tree.

Another approach might be to calculate the "components" of the UniFrac distance independently and attempt to see which parts of it are the largest. By that I mean really computing UniFrac yourself, but instead of completing the calculation, you could stop at "branch lengths unique to sample-type A", "branch lengths shared between both", and "branch lengths unique to sample-type B". This also isn't ASVs, but it would at least give you a direction to look, e.g. is the behavior of UniFrac here dominated by shared features or differentiated features, and if the latter, from which sample type?

In summary, there's really no way to do this at the moment, but maybe someday we'll have tools that can pick apart these population summaries so that we can tie them back to ASVs.

Final note:
Does anyone know of a tool capable of doing this? You might save me a whole lot of work if someone's found a good way to do this already

mortonjt · May 18, 2018, 3:43pm

You could try using biplots -- @yoshiki has done quite a bit of work on this.

yoshiki · May 22, 2018, 10:43pm

In addition to what has already been mentioned, this paper might be of your interest. Briefly, this algorithm is capable of identifying the features responsible for driving the differences between groups of samples in the context of a UniFrac distance matrix.

system · June 23, 2018, 4:43am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.