How to tell which UniFrac calculation is correct from different software

wallacelab · October 9, 2019, 7:54pm

As part of investigating a bug in phyloseq's UniFrac calculation (referenced here), I did a comparison of QIIME with several different R software packages' UniFrac calculations.

Short version: with 4-5 different pieces of software, I got nearly as many different UniFrac values out (though most strongly correlated; phyloseq weighted unifrac being the exception due to a probable bug):

# Correlation matrix for Weighted Unifrac
         qiime  phyloseq     rbiom  gunifrac
qiime        1 0.3009552 1.0000000 0.9757110
phyloseq    NA 1.0000000 0.3009552 0.4489365
rbiom       NA        NA 1.0000000 0.9757110
gunifrac    NA        NA        NA 1.0000000

# Correlation matrix for Unweighted Unifrac
         qiime  phyloseq     rbiom  gunifrac   picante
qiime        1 0.9791214 0.9998755 0.9998755 1.0000000
phyloseq    NA 1.0000000 0.9799686 0.9799686 0.9791214
rbiom       NA        NA 1.0000000 1.0000000 0.9998755
gunifrac    NA        NA        NA 1.0000000 0.9998755
picante     NA        NA        NA        NA 1.0000000

This worries me because, as I understand it, UniFrac is supposed to be completely deterministic. The original issue says that QIIME benchmarks its implementation via scikit-bio's unit tests, so for now I'm trusting it, but this variability has me worried. Is there some other independent dataset out there that can be benchmarked against (ideally, something as complex as real data) to know which implementation is truly correct?

wallacelab · October 10, 2019, 12:25pm

Update: I loaded the skbio benchmark data into R and checked it. Phyloseq failed to run (see above re: bug), but the others all work perfectly on the benchmark data:

# Correlation matrix on skbio's benchmark weighted UniFrac
          benchmark rbiom gunifrac
benchmark         1     1        1
rbiom            NA     1        1
gunifrac         NA    NA        1

# Correlation matrix on skbio's benchmark unweighted UniFrac
          benchmark rbiom gunifrac picante
benchmark         1     1        1       1
rbiom            NA     1        1       1
gunifrac         NA    NA        1       1
picante          NA    NA       NA       1

This is what concerns me: the benchmark data apparently lacks some feature of the Moving Pictures dataset that causes the UniFrac calculations to differ. I could just dismiss the ones that are 99.9% similar as rounding errors/floating point precision issues, but the ones that are 98% or lower seem like too big a difference for that to be true.

Is it the case that the original PyCogent implementation is considered "the" correct implementation that I should check against (using a rooted tree to make it comparable), or is one of the later implementations considered more correct? (I realize that even a 97% correlation will probably result in the same practical interpretation; I'm just concerned about so many different packages yielding different results for what should, in theory, have only a single answer.)

wallacelab · October 10, 2019, 2:24pm

Update: I realized I hadn't rarefied the Moving Pictures data before running the different Unifracs. Using the rarefield_table.qza artifact leads to unweighted unifrac matching across all non-phyloseq software, but weighted is still different for GUniFrac, so I assume the weighting is going off somewhere?

# Weighted, rarefied
         qiime  phyloseq     rbiom  gunifrac
qiime        1 0.5098069 1.0000000 0.9906098
phyloseq    NA 1.0000000 0.5098069 0.5729023
rbiom       NA        NA 1.0000000 0.9906098
gunifrac    NA        NA        NA 1.0000000

# Unweighted, rarefied
         qiime phyloseq    rbiom gunifrac  picante
qiime        1 0.975154 1.000000 1.000000 1.000000
phyloseq    NA 1.000000 0.975154 0.975154 0.975154
rbiom       NA       NA 1.000000 1.000000 1.000000
gunifrac    NA       NA       NA 1.000000 1.000000
picante     NA       NA       NA       NA 1.000000

This at least is close enough to alleviate some of my concerns, although I'm still worried about why the non-rarefied data is coming out so different. Hopefully rarefaction removes most of whatever the problem is.

colinbrislawn · October 10, 2019, 4:33pm

Hello @wallacelab,

Welcome to the forums! :qiime2:

Thanks for posting your UniFrac tests. I'm not 100% sure about what's going on here, but I'm glad that the newest tests show that the Qiime software stack is internally consistent.

Both @wasade and @gregcaporaso know a lot more about the history of these implementations than I do, and hopefully they can 'chime-in.' You are right that UniFrac should be deterministic, so even small changes between implementations are worrying.

Thanks for bringing this to our attention!

Colin

wasade · October 10, 2019, 4:49pm

Hi @wallacelab,

Thank you for joining the form, and for taking the time to perform this assessment and post the exploration! UniFrac is deterministic, and variation in results is unexpected.

QIIME 2 uses an implementation of Striped UniFrac. The unit tests for Striped UniFrac are derived from the unit tests for the Fast UniFrac implementation in scikit-bio (which QIIME 2 used previously). These implementations exhibit identical results to the best of our knowledge. The unit tests for Fast UniFrac in scikit-bio are derived from the unit tests associated with the reference Fast UniFrac implementation in PyCogent, which were developed by the original authors of UniFrac, and which also exhibit identical results to the best of our knowledge.

We consider the scikit-bio tests to be definitive as they are more comprehensive than the original PyCogent ones, enumerate known edge cases, and include numerous hand done validations. The UniFrac implementation used by QIIME 2 appears consistent with implementations spanning over 10 years of use, and with the reference implementations developed by the original authors of UniFrac and Fast UniFrac.

Best,
Daniel