I am trying to develop a tool that can do the following:
Assuming you have an existing databank of 1000 samples ->
-generate the phylogenetic tree
-calculate the UniFrac distance (weighted or unweighted) matrix
-perform PCoA on the distance matrix
Once I have this databank distance matrix calculated, I would like to be able to add new samples to it on demand, allowing me to plot them in the PCoA space. However, this would require me to re-compute the phylogenetic tree every time and therefore also impact my previously computed distances between samples because the tree will change (correct me if I’m wrong about this).
Therefore, the method I have come up with is - for each individual sample in my baseline set of 1000 samples:
-filter it out from the feature table
-filter out its sequences
Then, for each sample pair, I would create a tree (from 2 samples) and calculate the Unifrac distance, manually storing it in some matrix. This would (in theory) allow me to take in any new sample, create its trees with the 1000 baseline samples and then calculate the Unifrac distance and visualize it in my PCoA space, without impacting the overall PCoA calculation.
With 1000 samples, this would give about 1,000,000 trees and take quite a long time and a lot of space due to the individual trees and merged seqs/ feature tables.
Therefore, I am trying to avoid this computationally intensive and potentially illogical approach.
My question is - does the Unifrac distance change between two samples based on the tree that they’re in ? If adding a third sample to my tree for example doesnt change the initial distance between the first 2, then I should be OK to re-compute my tree every time, however if it does, this makes me more inclined towards the pairwise approach.
I have a couple of thoughts here… which may or may not be of use.
First, there’s a pairwise test for unweighted (and weighted) UniFrac distance in scikit-bio which may answer your underlying question better than trying to do it in QIIME. It does require some python programing, but it integrates okay with the Artifact API. I find the scikit-bio implementation is slower, but it may be a solution. …It has the added benefit that you’re potentially storing fewer files on disk, but might be expensive in terms of memory and may be slow, if you run it in serial.
A second thought might be to consider how pruning affects your tree? I think you can do filtering here with the phylogeny plug in. You’re still back to that pairwise calculation, and it may still be expensive.
This sounds interesting, however the UniFrac calculation will vary if the topology or branch lengths relevant to those samples change. For the modern UniFrac algorithms (i.e., the fast ones), it’s much faster to compute UniFrac in bulk over many samples rather than pairwise. For 1k samples, the time to calculate UniFrac even with a large tree should be < 1 minute which will be much less than the time for fragment insertion. It might be easier to consider only occasionally updating the tree (which is expensive) and recomputing UniFrac when that occurs. And when new samples come in, to filter the samples to the feature mass represented in the tree (which is fast), and recompute UniFrac for all pairs (which is fast). As a third alternative, you might be able to disregard tree construction entirely by using an existing comprehensive tree such as the one from the Earth Microbiome Project. Does this make sense?
Hi @jwdebelius,
Thank you for the response ! I am not so much worried about the Unifrac computation aspect as much as the tree aspect. I will look into the scikit bio method for Unifrac though as well so thank you for that.
Hi Daniel,
Thanks for the reply. To clarify on your response:
“the UniFrac calculation will vary if the topology or branch lengths relevant to those samples change”
Just to confirm: If let’s say I have a tree made from 4 samples and I generate my Unifrac distances from this, would those distances change if I add a 5th sample and re-generate the tree + re-compute Unifrac using these new 4+1 samples ? Or would it stay the same between the 4 old samples ?
Basically I am trying to find a way so that the Unifrac distance remains the same between the samples I use as my initial databank, no matter what new samples I add in the future.
For filtering new samples according to the feature mass in the tree - I am worried that this will cause information loss from new samples, especially if the new samples coming in are phylogenetically distinct to my baseline. What do you think ?
Using a comprehensive tree could be interesting. I will look into this.
Thank you for your detailed response and I look forward to hearing more.
Whether the distances would change depends on how the tree is constructed, so there isn’t a yes/no. It should be stable with fragment insertions though.
It would result in information loss but with a sufficiently comprehensive tree the relationships between your samples probably won’t really change much. And the data are already heavily biased by the upstream sample collection procedures and molecular steps which are probably a stronger technical bias than using a fixed tree.
Ok, so here is a workflow Ive thought of, let me know what you think:
(It might not even require scikit bio)
-Create a master tree from my baseline 1000 samples.
-Compute Unifrac matrix from this tree.
-For any new sample that comes in, merge the sequences and feature tables pairwise with the 1k samples, generate 1k merged trees and then perform pairwise Unifrac with the 1000 trees and plugin my distance vector to my PCoA loading vectors.
This means I only have to do ~1000 Unifrac computations and 1000 tree generations per new sample.
@jwdebelius@wasade Just as an update, I validated the above approach on 100 samples. Here are my results:
I created a single phylogenetic tree and a Unifrac Distance Matrix from this tree.
I also created 10,000 phylo trees for each pairwise combination of the 100 samples (100 x 100 = 10,000). For each pairwise combination, I calculated the Unifrac distance and aggregated them into a matrix.
I then compared the two distance matrices Id generated and noticed that there was only 1.8% average +/- 1.4% difference between the pairwise Unifrac and the aggregated Unifrac.
Therefore I think the approach I listed above should work. I want to thank you for your help in guiding my work, have a great day!