I have a question about the comparing distance metric distances between analyses. We are working on general quality control and sources of technical variance and noise paper. We have ~335 participants from 3 different studies that provided both a fecal swab and an OmniGene gut sample from the same bowel movement. We are trying to extract information about how the sample type and transit time (difference in time between when sample was collected and when it was placed in the freezer) effect microbial communities. Many of these samples were shipped at ambient conditions - hence the transit time interest. However, when looking at the weighted unifrac distances generally and pairwise differences between sample types, they seem high relative to previous analyses that I have done and what I have seen in the literature. I am not sure if this is even something that I should be concerned about because it is not apples to apples generally looking at 2 different analyses. Additionally this set of samples was put together by merging 14 sequencing runs and has 670 total samples. Finally, we are very aware of the facilitative anaerobes that can "bloom" during shipping of unstabilized samples (Amir et al., (2017) and others). Anyway, here are the pairwise differences grouped by study for paired swab-omni samples. These median distance seems high since these are technically the same sample just different sample collection methods and varying transit time (0-12 days). I greatly appreciate your feedback in advance!
I am very excited for this paper!
There are different ways to calculate UniFrac distances. You have already mentioned Weighted distances (instead of Unweighted), which incorperates the abundance of microbes in the calculation.
Some methods of UniFrac directly report shared branch length, and others report the fraction of shared branch length. If the total branch length of the tree is 1.0, these numbers will be identical. Otherwise, these numbers will be different, even if they are conceptually equivalent.
As a toy example using just two samples and a small tree:
Total branch length inside the tree: 3.0
Shared branch length of two samples: 1.0
UniFrac = (3.0-1.0) = 2.0 (raw)
UniFrac = (3.0-1)/3.0 = 0.66 (fraction)
Rescaled tree to a total branch length of 1.0
Total branch length inside the tree: 1.0
Shared branch length of two samples: 0.33
UniFrac = (1-0.33) = 0.66 (raw)
UniFrac = (1-0.33)/1= 0.66 (fraction)
See how rescaling the total tree branch length changes the raw UniFrac values?
Thank you for this example. Very helpful!
In synopsis, one should not be too concerned about the distances relative to other analyses because it is apples to oranges.
Thanks again for the quick and thorough response!
A couple more question...
If our processing pipeline denoises with deblur and we use SEPP to generate our tree and rep_seqs. Does this effect the scaling/ rescaling of the tree (sorry if that question is unintelligible - I am a little beyond my understanding at this point)?
Also how does increasing the number of merged runs and/or total number of samples influence the tree and distances?
Or rather it's inches vs centimeters; they are the same!
(but here, people are not saying what units they are using )
(and I thought everyone knew which one to use, but I'm always surprised!)
Statistical comparisons should still be valid, as it's all just a scaling factor.
This is a great question! I'm not sure how SEPP works...
This can depend on if the tree is being built, rebuilt, or modified. I'm not sure how SEPP does it, so we should investigate.
Unless the tree or calculation normalizes the total branch length to 1
Thanks again for the speedy response. Just wanted to close the loop on this and show the scaling comparison for future users that run SEPP and are concerned about their distance values.
I remade the tree using
align-to-tree-mafft-fasttree and like you suggested there was a scaling difference, but the stats was generally the same. Here is the scaling comparison from my first post but with the updated tree.