Is percentile-normalized data appropriate for UniFrac?

Hi Claire,

Sorry I was following the post and one question poped into my mind.

I was wondering if you could let me know whether I could use percentile-normalized data for metrics such as weighted and unweighted Unifrac. The only method available in qiime 2 is rarefaction and I am not sure if its appropriate for such metrics.

Hi @ptalebic! Unfortunately, I don’t think percentile-normalized data will work for either weighted or unweighted unifrace.

Unweighted unifrac uses the presence/absence of OTUs (or ASVs or whatever your features are) to calculate beta diversity. Because percentile normalization can’t handle zeroes very well, the algorithm adds non-zero noise to any OTUs with zero count (see my blog post for more on that). That means that basically all of your values will be non-zero after percentile normalization, and so the unweighted unifrac calculation will not be meaningful.

Weighted unifrac also won’t be meaningful, since the data that comes out of percentile-normalization isn’t actually an abundance – it’s the percentile that this OTU in this sample falls relative to this OTU in all control samples. So using that as an abundance doesn’t really make sense. (But maybe @seangibbons has additional thoughts on this? [Hi Sean!])

Finally, rarefaction also won’t work because the percentiles that come out of the normalization aren’t discrete values, they are continuous from 0 to 100. Rarefaction only works with counts, so it’s not applicable here.

Sorry to be a bummer! Hopefully @seangibbons can provide additional insight into what metrics might be useful, or perhaps others in the QIIME 2 community have found some good ones. :slight_smile:


Thank you so much for the great explanation.

Yup, I agree completely with Claire [Hi Claire!]. Percentile normalization will erase any information on how abundant one OTU is relative to another within a sample, so weighted distance beta-diversity metrics (like weighted UniFrac or Bray Curtis) will be less meaningful and difficult to interpret. And the fuzzy zero issue makes unweighted beta-diversity metrics derived from percentile-normalized data problematic, as Claire described.


Hi @ptalebic,

I think currently Rob Knight’s group uses rarefaction for UniFrac distance based on the 2017 paper by Weiss et al. Rob developed UniFrac, so while it’s a necessary evil, it’s a current recommendation. An adonis model would let you include the sequencing depth as a term, so you could adjust for that in some of your statistics.

If you want to go rarefaction-less, you may want to look at DECOIDE, which avoids rarefaction but loses the phylogenetic inference.



This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.