Adding more FeatureTable types to q2-diversity: how to handle "inappropriate" analyses?

cduvallet · December 12, 2018, 10:33pm

I'm looking into updating the q2-diversity plugin to accept different FeatureTable types (starting with PercentileNormalized but probably also going ahead and doing RelativeFrequency while I'm at it #sharethelove).

In thinking ahead to coding this up, I have a question that'd be great to get developers' thoughts on: how should we deal with the "appropriateness" of different metrics for different data types? I'm specifically thinking of cases where the underlying skbio code works without returning an error, but the calculation itself is inappropriate to be doing on that data.

I can think of a few solutions, ranked from (my guesses on) most to least effort:

Change q2-diversity so that types are checked for each alpha and beta metric, not just the overall beta or alpha function calls. I think this would involve registering a new function for each metric type, or would there be another way?
Keep the type definition at the broad q2-diversity alpha and beta functions, but do some data/type checks within the function itself and raise an error or warning if we think the wrong data type is being used.
Just let users run their feature tables with all metrics that don't throw errors, and put in the documentation somewhere which data types are not appropriate for which metrics.

What do y'all think?

And two related questions:

I'm actually not super familiar with the theoretically appropriate data types for each underlying method - do you know if there's any good place to get that info? AKA am I making a big deal out of nothing, and it's actually okay to use any sort of data for basically every metric? Or am I correct in assuming that some metrics absolutely need a certain type of data?
I also don't know how scikitbio handles this. From their documentation, it seems that they expect feature tables with count data (i.e. Frequency data type, numbers that can be cast to integers). But then looking specifically at the beta diversity doc, seems that either count or abundance data is okay? Either way, in my experience, throwing in other data types seems to work fine -- for example, @seangibbons' recent workshop (slide 69 of the pdf) hackily converted a feature table actually containing PercentileNormalized data into one with a Frequency data type, and then used that feature table to make a PCoA plot based on Bray-Curtis distance. The PCoA looks good and the code ran without error, so this is obviously a possibility with the underlying code. Do you know whether and how skbio handles this issue?

thermokarst · December 14, 2018, 2:13am

Hey there @cduvallet! A handful of the core development team are currently teaching QIIME 2 workshops over the next two weeks, so it might be a while before you get a satisfactory response from someone --- sorry! Just wanted to let you know that we have this one in the queue

ebolyen · January 8, 2019, 4:45pm

Hey @cduvallet,

Happy new years, and sorry this took so long.

I think you've really hit the nail on the head here, and we're actually in the process of implementing the first option:

@ChrisKeefe is going to be creating a new plugin which defines granular actions for each of these metrics which will allow each one to have its own type signature. We'll then be converting alpha/beta into pipelines which will do what you suggest as your second point

I'm in this camp myself, there's a slew of normalization techniques which become more or less appropriate, but I'm also fuzzy on the specifics.

Some metrics/measures absolutely need certain kinds of input, for example, computing Chao1 without singletons/doubletons is a pointless exercise. But something like Bray-Curtis may be more flexible with respect to the kinds of inputs that make sense.

I personally hope we can start addressing this with semantic properties which could encode the more refined details like "contains singletons", or "stabilizes variance". This all requires TypeMap to let us propagate that information. But once that is finished, I think we'll be in a good position to start answering these kinds of questions, and encoding those assumptions in a formal way.

cduvallet:

I also don’t know how scikitbio handles this. From their documentation, it seems that they expect feature tables with count data (i.e. Frequency data type, numbers that can be cast to integers). But then looking specifically at the beta diversity doc, seems that either count or abundance data is okay? Either way, in my experience, throwing in other data types seems to work fine – for example, @seangibbons’ recent workshop (slide 69 of the pdf) hackily converted a feature table actually containing PercentileNormalized data into one with a Frequency data type , and then used that feature table to make a PCoA plot based on Bray-Curtis distance. The PCoA looks good and the code ran without error, so this is obviously a possibility with the underlying code. Do you know whether and how skbio handles this issue?

My recollection is that skbio uses numpy/scipy directly and so it doesn't generally do any type-coercion, with the exception of qualitative metrics/measures. So whatever the formula is, it will be applied without any special regard for the machine format of the number.

Hope that's helpful!

cduvallet · January 14, 2019, 9:11pm

Sounds great! Once that's done, I can go back and add in PercentileNormalized wherever it makes sense to.