I’m looking into updating the
q2-diversity plugin to accept different FeatureTable types (starting with PercentileNormalized but probably also going ahead and doing RelativeFrequency while I’m at it #sharethelove).
In thinking ahead to coding this up, I have a question that’d be great to get developers’ thoughts on: how should we deal with the “appropriateness” of different metrics for different data types? I’m specifically thinking of cases where the underlying skbio code works without returning an error, but the calculation itself is inappropriate to be doing on that data.
I can think of a few solutions, ranked from (my guesses on) most to least effort:
q2-diversityso that types are checked for each alpha and beta metric, not just the overall beta or alpha function calls. I think this would involve registering a new function for each metric type, or would there be another way?
- Keep the type definition at the broad
betafunctions, but do some data/type checks within the function itself and raise an error or warning if we think the wrong data type is being used.
- Just let users run their feature tables with all metrics that don’t throw errors, and put in the documentation somewhere which data types are not appropriate for which metrics.
What do y’all think?
And two related questions:
I’m actually not super familiar with the theoretically appropriate data types for each underlying method - do you know if there’s any good place to get that info? AKA am I making a big deal out of nothing, and it’s actually okay to use any sort of data for basically every metric? Or am I correct in assuming that some metrics absolutely need a certain type of data?
I also don’t know how scikitbio handles this. From their documentation, it seems that they expect feature tables with count data (i.e. Frequency data type, numbers that can be cast to integers). But then looking specifically at the beta diversity doc, seems that either count or abundance data is okay? Either way, in my experience, throwing in other data types seems to work fine – for example, @seangibbons’ recent workshop (slide 69 of the pdf) hackily converted a feature table actually containing PercentileNormalized data into one with a Frequency data type, and then used that feature table to make a PCoA plot based on Bray-Curtis distance. The PCoA looks good and the code ran without error, so this is obviously a possibility with the underlying code. Do you know whether and how skbio handles this issue?