Short version: with 4-5 different pieces of software, I got nearly as many different UniFrac values out (though most strongly correlated; phyloseq weighted unifrac being the exception due to a probable bug):
# Correlation matrix for Weighted Unifrac
qiime phyloseq rbiom gunifrac
qiime 1 0.3009552 1.0000000 0.9757110
phyloseq NA 1.0000000 0.3009552 0.4489365
rbiom NA NA 1.0000000 0.9757110
gunifrac NA NA NA 1.0000000
# Correlation matrix for Unweighted Unifrac
qiime phyloseq rbiom gunifrac picante
qiime 1 0.9791214 0.9998755 0.9998755 1.0000000
phyloseq NA 1.0000000 0.9799686 0.9799686 0.9791214
rbiom NA NA 1.0000000 1.0000000 0.9998755
gunifrac NA NA NA 1.0000000 0.9998755
picante NA NA NA NA 1.0000000
This worries me because, as I understand it, UniFrac is supposed to be completely deterministic. The original issue says that QIIME benchmarks its implementation via scikit-bio’s unit tests, so for now I’m trusting it, but this variability has me worried. Is there some other independent dataset out there that can be benchmarked against (ideally, something as complex as real data) to know which implementation is truly correct?
Update: I loaded the skbio benchmark data into R and checked it. Phyloseq failed to run (see above re: bug), but the others all work perfectly on the benchmark data:
# Correlation matrix on skbio's benchmark weighted UniFrac
benchmark rbiom gunifrac
benchmark 1 1 1
rbiom NA 1 1
gunifrac NA NA 1
# Correlation matrix on skbio's benchmark unweighted UniFrac
benchmark rbiom gunifrac picante
benchmark 1 1 1 1
rbiom NA 1 1 1
gunifrac NA NA 1 1
picante NA NA NA 1
This is what concerns me: the benchmark data apparently lacks some feature of the Moving Pictures dataset that causes the UniFrac calculations to differ. I could just dismiss the ones that are 99.9% similar as rounding errors/floating point precision issues, but the ones that are 98% or lower seem like too big a difference for that to be true.
Is it the case that the original PyCogent implementation is considered “the” correct implementation that I should check against (using a rooted tree to make it comparable), or is one of the later implementations considered more correct? (I realize that even a 97% correlation will probably result in the same practical interpretation; I’m just concerned about so many different packages yielding different results for what should, in theory, have only a single answer.)
Update: I realized I hadn’t rarefied the Moving Pictures data before running the different Unifracs. Using the rarefield_table.qza artifact leads to unweighted unifrac matching across all non-phyloseq software, but weighted is still different for GUniFrac, so I assume the weighting is going off somewhere?
# Weighted, rarefied
qiime phyloseq rbiom gunifrac
qiime 1 0.5098069 1.0000000 0.9906098
phyloseq NA 1.0000000 0.5098069 0.5729023
rbiom NA NA 1.0000000 0.9906098
gunifrac NA NA NA 1.0000000
# Unweighted, rarefied
qiime phyloseq rbiom gunifrac picante
qiime 1 0.975154 1.000000 1.000000 1.000000
phyloseq NA 1.000000 0.975154 0.975154 0.975154
rbiom NA NA 1.000000 1.000000 1.000000
gunifrac NA NA NA 1.000000 1.000000
picante NA NA NA NA 1.000000
This at least is close enough to alleviate some of my concerns, although I’m still worried about why the non-rarefied data is coming out so different. Hopefully rarefaction removes most of whatever the problem is.
Thanks for posting your UniFrac tests. I'm not 100% sure about what's going on here, but I'm glad that the newest tests show that the Qiime software stack is internally consistent.
Both @wasade and @gregcaporaso know a lot more about the history of these implementations than I do, and hopefully they can 'chime-in.' You are right that UniFrac should be deterministic, so even small changes between implementations are worrying.
Thank you for joining the form, and for taking the time to perform this assessment and post the exploration! UniFrac is deterministic, and variation in results is unexpected.
QIIME 2 uses an implementation of Striped UniFrac. The unit tests for Striped UniFrac are derived from the unit tests for the Fast UniFrac implementation in scikit-bio (which QIIME 2 used previously). These implementations exhibit identical results to the best of our knowledge. The unit tests for Fast UniFrac in scikit-bio are derived from the unit tests associated with the reference Fast UniFrac implementation in PyCogent, which were developed by the original authors of UniFrac, and which also exhibit identical results to the best of our knowledge.
We consider the scikit-bio tests to be definitive as they are more comprehensive than the original PyCogent ones, enumerate known edge cases, and include numerous hand done validations. The UniFrac implementation used by QIIME 2 appears consistent with implementations spanning over 10 years of use, and with the reference implementations developed by the original authors of UniFrac and Fast UniFrac.