Recommended type for features - reads per kilo base?

John_Chase · December 6, 2022, 5:49pm

Hello,

I am working on updating a plugin for Humann3. The output of Humann3 is a set of tables that contain data in reads per kilobase.

I am trying to find the correct importable type to specify for this data type as output to the plugin as e.g. FeatureData[Frequency]. The data should be feature data, however, I cannot find the specific type for the data itself. Being read per kilo base, values are: floats > 0.

It may be the case where there is a not a default type defined, but I wanted to ask here before creating a new type.

Thanks

gregcaporaso · December 7, 2022, 3:58pm

Hey @John_Chase,
I landed on FeatureTable[Frequency] for this in q2-sapienns (which consumes Humann3 output files for use in QIIME 2). This makes sense from my perspective - the data are similar in nature to rarefied feature tables (in that some normalization has been applied). It is different than our typical FeatureTable[Frequency] in that the values are floats rather than ints, but I don't think that causes an issue.

Do you think FeatureTable[Frequency] doesn't make sense for this? Interested in what you or others think.

John_Chase · December 7, 2022, 10:32pm

Thanks @gregcaporaso this is what I was thinking originally though I was under the impression that Frequency data had to be whole numbers, which in fact was not the case.

My concern would be the situation where there is some statistical function expecting ints, that would not fail, or worse work poorly with float data. I am not familiar enough with existing methods to say that this would in fact be the case.

gregcaporaso · December 9, 2022, 3:45pm

@John_Chase, I agree, that's a valid concern. I'm not certain that this is the best solution for that reason, but it's where I started.

I was hesitant to create a new type because I didn't know if/where this would be a problem, and I didn't want to proliferate types unnecessarily (the values do still seem to me to fit the definition of "Frequency"). That would also require update of a lot of actions that I'd want to use this with throughout the q2-plugin-verse to accept the new type, which (to be honest) seemed like a lot of work if I wasn't sure it was needed.

Another way to handle this that may be safer would be to round values to whole numbers. In this case, it probably makes more sense to always round up (i.e., np.ceil - similar to what we do in feature-table group when we take the mean or the median) so that small values don't get rounded down to zero.