Presence-absence distance method

There’s an example used the Core concepts document section discussing semantic types pointing out that you could run a quantitative distance method with qualitative data, but you shouldn’t, so QIIME’s semantic formatting ensures you don’t do something unintended.

I’m interested in using a Bray-Curtis distance calculation on binary (presence-absence) data. The initial feature table was collapsed with qiime feature-table presence-absence, but when I try to incorporate that into qiime diversity beta the error suggests that it’s looking for a table with relative abundances, not just 0’s and 1’s.

Plugin error from diversity:

  Argument to parameter 'table' is not a subtype of FeatureTable[Frequency].

I couldn’t find anything in the forum or documentation that suggested how to use presence-absence data. Usually if I can’t find anyone asking the question, or any documentation, it means I’m making a terribly incorrect assumption about how to use a dataset appropriately!

Nevertheless, in this one case I feel pretty confident this is because most QIIME users are dealing with microbial datasets where relative abundances are an acceptable input for a distance method. However my dataset is generated from insect COI sequence data, and I am a bit skeptical about using read abundances to make ecological inferences in alpha and beta diversity measures. If nothing else, I’m happy to explore using the quantitative approaches, but I have to at least also explore the qualitative inputs using presence-absence data.

If there are any suggestions about how to incorporate presence-absence data into diversity calculations in QIIME I’d love to find out more. Perhaps my only alternative is to export the binary matrix and perform the analyses in something like vegan in R (or an equivalent Python package). If that is the case, it would be great to know if there is a way to compute the distance matrix externally, but still be able to bring in that matrix back into QIIME for the qiime diversity beta-group-significance function.

Many thanks!

1 Like

Hi @devonorourke,

It sounds like you need a different distance metric. Bray Curtis disimilarity makes the assumption that you’re dealing with abundance. QIIME 2 currently offers 22 non-phylogenetic metrics, several of which are qualitative metrics, based on presence/absence data.

Ive seen Jaccard Distance used with some frequency, but Id suggest looking through the list and seeing if something there works well for you. (You can get the list from the help documentation.)

Best,
Justine

3 Likes

Just to add to @jwdebelius suggestion, you can find a nice summary of all currently available alpha and beta diversity metrics in qiime2 in this community contribution. Might make things a bit easier.

3 Likes

@devonorourke: @jwdebelius is spot on with this suggestion:

Bray Curtis on presence-absence data pretty much becomes Jaccard. The equations will not produce the same values on identical data, but they will be closely related.

Jaccard distance = 1- (features common to both samples) / (total number of features in both samples)

Bray-Curtis dissimilarity = 1- 2 * (sum of lesser value for features common to both samples) / (total number of features found at site A + site B)

Thanks to @jwdebelius, @Nicholas_Bokulich, and @Mehrbod_Estaki for their replies.

Apologies for confusing Bray-Curtis with binary data as the same thing as a Jaccard calculation. In my defense, I’m an idiot. Actually my idiocy is rooted in the very point of this post - semantics. I got started a few weeks ago learning about these diversity methods and metrics in R, playing around with the vegan package. I was under the assumption I could load binary data into a Bray-Curtis metric because R let me do it:

## in an R environment
library(vegan)
data(dune)
bray = vegdist(dune, "bray")
bray_bi = vegdist(dune, "bray", binary=TRUE)

In both cases, I get two (distinct) outputs, both which appeared valid (valid meaning that I got an output and not an error - not valid in the sense that I input data appropriately!). In fact, there was another vegan function that mentioned explicitly that a binary Bray-Curtis calculation is just a Sorensen index - see the very bottom of this document. So it goes to show that semantics matters - as someone without any experience with these methods could really make a mess of things!

I still have a few lingering questions though:

As @jwdebelius pointed out, there are a lot of options using non-phylogenetic methods. Likewise, as @Mehrbod_Estaki pointed out, there’s a wonderful thread illustrating alpha and beta metrics available in QIIME. In fact, it appears that Unifrac might even be able to handle presence-absence data from what I can tell from the gold standard that is a Wikipedia post… Maybe?

My first point, and the point really about semantics all along, is that there isn’t any obvious way in QIIME to know which of those are going to work with a binary dataset. It requires ([checks notes]: your own hard work?!?) to ensure you’re using the right test for the right reasons. I’m not against requiring users to know what the heck they’re doing, but I am in favor of making it easier. I think that was the whole point of the semantics thing in QIIME in the first place, right?

My second point, which is really a tiny feature request, is to incorporate the list that @Mehrbod_Estaki pointed to within a qiime diversity beta or qiime diversity beta-phylogentic argument, just like what you have already done with --show-importable-types in the qiime import function. You do already show the available metrics, but I think a parameter like qiime diversity beta --show-metric-info would be a quick way to keep the list of functions available in one place within the program, rather than buried in a thread that may or may not get updated as new distance metrics are added to the function. Adding that parameter would not just show the available metrics, but include the kind of brief summary info (with links to relevant citations) where a user could then navigate to teach themselves about which test is best for their project. Or maybe I’m just making busy work for someone else :thinking:

Third point: if one does figure out what they’re doing (or gets great advice from helpful people like you) and uses a qualitative index like Jaccard, it’s perhaps a bit counterintuitive why I would get an error in QIIME if I passed in a binary dataset. I think I know why: it’s because QIIME only wants a semantic filetype FeatureTable[Frequency]. I would think though that it shouldn’t be a problem to pass in a FeatureTable[PresenceAbsence] input to Jaccard - that seems to me the motivation behind using these semantic types, right? Perhaps it’s not useful for reasons I am not considering.

My last question is about incorporating distance matricies calculated externally back into QIIME. Is there a way to import a distance matrix as a .tsv file so that I can continue to use QIIMEs handy visualizations?

Thanks for all the thoughtful feedback everyone

1 Like

You are not wrong — I don't think there is anything wrong with calculating Bray-Curtis dissimilarity on a presence-absence matrix. It's just that it becomes quite similar to Jaccard, so you may as well use Jaccard.

Oh cool, I did not know that one. Jaccard is close, but Sorensen == binary Bray-Curtis! :qiime2: does have the Sorensen index — listed by its pseudonym, the Dice index.

Yes — that is the point of the semantic types. And we did what we thought made things easiest/simplest: all beta diversity metrics in :qiime2: can only accept frequency tables, but qualitative metrics convert these to presence-absence data. So whether you are computing Jaccard or Bray-Curtis you input the same data and :qiime2: does the rest. So there is no need to convert your data to multiple formats, and no need to know which format pairs with which method to use these methods correctly.

But please let us know how we could make this simpler/easier. We are always open to contributions, and if you think that improved documentation would fix this then let's talk :wink:

I have mixed feelings on this. On the one hand, it makes it easier for users to learn about these methods, and resources like that forum post obviously are not too visible. On the other hand, all of this information is easily available online and on wikipedia. The brief summaries are nice, but can only transmit so much information, certainly not enough to "teach [users] about which test is best for their project". Which boils down to one thing (you called it): this is more time we developers need to spend doing busy work, less time we get to work on other features and on documentation that is not googleable. What about just linking to something like that forum post from within the method description?

You are right, that error is counterintuitive in this case. We could let the beta method accept a FeatureTable[PresenceAbsence] artifact — that would fix this error but would then allow users to run metrics that may be inappropriate on binary data (inappropriate may not be the correct word, but you see my point — the sword cuts both ways doesn't it!)

A better error message would be useful, but this message is coming from the :qiime2: framework, which performs type validation, not the plugin. Improving the plugin description to indicate that :qiime2: does the heavy lifting and that converting to binary is not necessary could be a better solution in my mind, and I can raise an issue if you agree — this would be a great first contribution to QIIME 2 if you are interested :wink: :wink: :wink:

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.