Very different weighted unifrac results for qiime2 versus phyloseq

nick-youngblut · June 16, 2018, 7:00am

A couple of members of my lab and I have been getting very different results when using qiime2 versus phyloseq for calculating weighted unifrac values. It seems that the default parameters used by these tools (qiime diversity core-metrics-phylogenetic versus phyloseq::distance) generate very different weighted unifrac values, even when pre-rarefying the dataset (since qiime2 automatically rarefies). Maybe it's just something with the defaults that's causing the difference, but I can't tell what that is. I've tried phyloseq::Unifrac(weighted=TRUE, normalized=FALSE), but the resulting values are still different from qiime2. Maybe it's how phyloseq deals with the root, but again, I'm not sure why. I've been using rooted phylogenies, so the same root should be used for phyloseq and qiime2. I've attached a compressed Jupyter notebook file showing a reproducible example with the GlobalPatterns dataset from phyloseq. The version of qiime2 is a bit old, but I'm guessing the weighted unifrac algorithm hasn't changed in more recent versions.

I've also attached figure showing histograms of beta-diversity values calculated on GlobalPatterns (rarefied to min of any sample) with qiime diversity core-metrics-phylogenetic versus phyloseq::distance.

I think it would be good to make sure researchers know that they will get very different weighted unifrac values depending if the use the default methods for qiime2 versus phyloseq. Moreover, I'm not sure which method is "correct". From processing my own datasets with both methods, it appears that qiime2 generates much more reasonable values than phyloseq.

qiime2-vs-phyloseq.html.zip (105.9 KB)

ebolyen · June 18, 2018, 4:53pm

Thanks for the notebook, that is wonderful! I haven't looked too closely yet, but it is it possible that we're just seeing the result of rarefying twice? Granted I wouldn't expect weighted unifrac to look that different between different rarefactions, but that's the most obvious place to look initially.

If we were to take the rarefied table from core-metrics-phylogenetic and use that in phyloseq (skipping phyloseq's rarefaction), do the distributions line up again?

nick-youngblut · June 18, 2018, 5:07pm

I'm confused by "rarefying twice". If the rarefying depth for qiime diversity core-metrics-phylogenetic is set to the same value as the table that's already rarefied, shouldn't that essentially be the original counts? Rarefying in qiime2 by default is without replacement, correct?

nick-youngblut · June 18, 2018, 5:10pm

Yeah, at least for qiime feature-table rarefy, the docs state:

Subsample frequencies from all samples without replacement so that the sum of frequencies in each sample is equal to sampling-depth.

So, rarefying a table that's already been rarefied to that depth shouldn't change anything, correct?

ebolyen · June 18, 2018, 5:11pm

I apologize, I missed that your notebook wrote out the rarefied table and used that with QIIME 2, you are correct rarefying twice at the same depth doesn't do anything as we rarefy without replacement.

nick-youngblut · June 18, 2018, 5:18pm

No worries! I'm worried about the differences between phyloseq and qiime2, at least for their defaults. Both methods are used by members of my lab, so they may be getting very different results just because they are using phyloseq versus qiime2

ebolyen · June 18, 2018, 5:21pm

Of course!

Pinging @wasade who's implemented a whole slew of UniFrac varieties. Any idea what variety of UniFrac Phyloseq might be using here?

Nicholas_Bokulich · June 18, 2018, 5:28pm

@wasade @ebolyen @nick-youngblut linking to the phyloseq source code for convenience:

github.com

joey711/phyloseq/blob/master/R/distance-methods.R#L565


      
          # Fast UniFrac for R.
          # Adapted from The ISME Journal (2010) 4, 17-27; doi:10.1038/ismej.2009.97;
          # http://www.nature.com/ismej/journal/v4/n1/full/ismej200997a.html
          ################################################################################
          #' @importFrom ape prop.part
          #' @importFrom ape reorder.phylo
          #' @importFrom ape node.depth
          #' @importFrom ape node.depth.edgelength
          #' @keywords internal
          #' @import foreach
          fastUniFrac <- function(physeq, weighted=FALSE, normalized=TRUE, parallel=FALSE){
          	# Access the needed components. Note, will error if missing in physeq.
          	OTU  <- otu_table(physeq)
          	tree <- phy_tree(physeq)
          	# Some important checks.
          	if( is.null(tree$edge.length) ) {
          	  stop("Tree has no branch lengths, cannot compute UniFrac")
          	}
          	if( !is.rooted(tree) ) {
          	  stop("Rooted phylogeny required for UniFrac calculation")
          	}

wasade · June 20, 2018, 7:56pm

@nick-youngblut, I’m not sure of the source of the difference but thank you for flagging it. Weighted UniFrac is deterministic, and the implementations of UniFrac used by QIIME 2 are validated against the original implementation of UniFrac by Cathy Lozupone from PyCogent (unit tests for the version of UniFrac being used by QIIME 2 by default are here). Has anyone in your team followed up with the phyloseq developers about the difference?

Best,
Daniel

nick-youngblut · June 21, 2018, 6:45am

I posted an issue on the phyloseq github site the same day that I posted to this forum.

wasade · June 21, 2018, 1:16pm

Great, thank you for doing so.

Best,
Daniel

system · July 22, 2018, 7:16pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.