I am trying to develop a tool that can do the following:
Assuming you have an existing databank of 1000 samples ->
-generate the phylogenetic tree
-calculate the UniFrac distance (weighted or unweighted) matrix
-perform PCoA on the distance matrix
Once I have this databank distance matrix calculated, I would like to be able to add new samples to it on demand, allowing me to plot them in the PCoA space. However, this would require me to re-compute the phylogenetic tree every time and therefore also impact my previously computed distances between samples because the tree will change (correct me if I’m wrong about this).
Therefore, the method I have come up with is - for each individual sample in my baseline set of 1000 samples:
-filter it out from the feature table
-filter out its sequences
Then, for each sample pair, I would create a tree (from 2 samples) and calculate the Unifrac distance, manually storing it in some matrix. This would (in theory) allow me to take in any new sample, create its trees with the 1000 baseline samples and then calculate the Unifrac distance and visualize it in my PCoA space, without impacting the overall PCoA calculation.
With 1000 samples, this would give about 1,000,000 trees and take quite a long time and a lot of space due to the individual trees and merged seqs/ feature tables.
Therefore, I am trying to avoid this computationally intensive and potentially illogical approach.
My question is - does the Unifrac distance change between two samples based on the tree that they’re in ? If adding a third sample to my tree for example doesnt change the initial distance between the first 2, then I should be OK to re-compute my tree every time, however if it does, this makes me more inclined towards the pairwise approach.
Let me know what you think!