Presence-absence distance method

devonorourke · November 5, 2018, 2:51pm

Thanks to @jwdebelius, @Nicholas_Bokulich, and @Mehrbod_Estaki for their replies.

Apologies for confusing Bray-Curtis with binary data as the same thing as a Jaccard calculation. In my defense, I'm an idiot. Actually my idiocy is rooted in the very point of this post - semantics. I got started a few weeks ago learning about these diversity methods and metrics in R, playing around with the vegan package. I was under the assumption I could load binary data into a Bray-Curtis metric because R let me do it:

## in an R environment
library(vegan)
data(dune)
bray = vegdist(dune, "bray")
bray_bi = vegdist(dune, "bray", binary=TRUE)

In both cases, I get two (distinct) outputs, both which appeared valid (valid meaning that I got an output and not an error - not valid in the sense that I input data appropriately!). In fact, there was another vegan function that mentioned explicitly that a binary Bray-Curtis calculation is just a Sorensen index - see the very bottom of this document. So it goes to show that semantics matters - as someone without any experience with these methods could really make a mess of things!

I still have a few lingering questions though:

As @jwdebelius pointed out, there are a lot of options using non-phylogenetic methods. Likewise, as @Mehrbod_Estaki pointed out, there's a wonderful thread illustrating alpha and beta metrics available in QIIME. In fact, it appears that Unifrac might even be able to handle presence-absence data from what I can tell from the gold standard that is a Wikipedia post... Maybe?

My first point, and the point really about semantics all along, is that there isn't any obvious way in QIIME to know which of those are going to work with a binary dataset. It requires ([checks notes]: your own hard work?!?) to ensure you're using the right test for the right reasons. I'm not against requiring users to know what the heck they're doing, but I am in favor of making it easier. I think that was the whole point of the semantics thing in QIIME in the first place, right?

My second point, which is really a tiny feature request, is to incorporate the list that @Mehrbod_Estaki pointed to within a qiime diversity beta or qiime diversity beta-phylogentic argument, just like what you have already done with --show-importable-types in the qiime import function. You do already show the available metrics, but I think a parameter like qiime diversity beta --show-metric-info would be a quick way to keep the list of functions available in one place within the program, rather than buried in a thread that may or may not get updated as new distance metrics are added to the function. Adding that parameter would not just show the available metrics, but include the kind of brief summary info (with links to relevant citations) where a user could then navigate to teach themselves about which test is best for their project. Or maybe I'm just making busy work for someone else ...

Third point: if one does figure out what they're doing (or gets great advice from helpful people like you) and uses a qualitative index like Jaccard, it's perhaps a bit counterintuitive why I would get an error in QIIME if I passed in a binary dataset. I think I know why: it's because QIIME only wants a semantic filetype FeatureTable[Frequency]. I would think though that it shouldn't be a problem to pass in a FeatureTable[PresenceAbsence] input to Jaccard - that seems to me the motivation behind using these semantic types, right? Perhaps it's not useful for reasons I am not considering.

My last question is about incorporating distance matricies calculated externally back into QIIME. Is there a way to import a distance matrix as a .tsv file so that I can continue to use QIIMEs handy visualizations?

Thanks for all the thoughtful feedback everyone