Add more FeatureTable data types to q2-sample-classifier?

cduvallet · August 15, 2018, 6:22pm

I'm starting to go through and add the percentile normalized data type (FeatureTable[PercentileNormalized]) to the relevant functions where it should be an allowable input, and noticed that q2-sample-classifier currently only accepts FeatureTable[Frequency] as input. (L83 in plugin_setup.py)

Philosophically speaking, machine learning classifiers should accept any type of feature data (presence/absence, relative abundance, etc). Practically speaking, will this break the underlying sklearn code at any point? I don't think it should (though it might produce weird results, especially in regression), but wanted to get another brain on this before submitting the PR.

Here's the commit with my proposed edits to plugin_setup.py.

Nicholas_Bokulich · August 15, 2018, 6:57pm

some algorithms (e.g., SVMs I believe) can be very sensitive to different normalization techniques, others (e.g., random forests) are generally pretty robust. So FeatureTable[Frequency] is sort of a safer default until we can think through whether all the various table formats are appropriate input formats.

That said, I am also happy to open this up to all formats and let users decide for themselves what to do, provided we test this.

I think not, but I suppose it's possible.

that pretty much encapsulates my concerns — do not want to open it up until tested.

Frequency, RelativeFrequency, PresenceAbsence all definitely make sense.

Balance I am not so sure about — I suppose we can find out!

Composition is unnecessary.

PercentileNormalized you will need to tell me if it's appropriate . If there are negative values I'm not sure if/how some estimators might behave.

You can go ahead and submit that PR — and we can move further conversation over to the review. But I'll tell you now: it would be good to get some unit tests in there just to make sure these formats work. Could you construct some minimal feature tables (e.g., 3X3) for PercentileNormalized, RelativeFrequency, and PresenceAbsence and slap together a basic unit test that trains a classifier and regressor from each?

cduvallet · August 15, 2018, 8:43pm

Right, but I think it's probably outside of the scope of QIIME 2 to address these sorts of issues. If we try, I think we'll almost certainly fail to be totally exhaustive and/or keep up with different types of normalizations, so we might as well leave this up to the users to make (hopefully informed) decisions about.

I'm less familiar with Balance - is this the one where OTU abundances are converted to ratios of abundances between two parts of the tree? I wonder if this will introduce weird correlated features that would violate assumptions of some ML classifiers...? Anyone we can ask who would know?

Definitely appropriate!

What do you mean by "work"? Do you mean that they don't throw errors when you put them into sklearn, or that they produce meaningful results? But yes, that sounds fine to do!

Ok, will remove Composition and submit the PR.

Nicholas_Bokulich · August 15, 2018, 10:10pm

agreed, we can't control all of these situations, though that's sort of what the semantic types are meant to address, so limiting formats that just don't ever make sense (e.g., maybe Balances???) is responsible.

@mortonjt — do you have any thoughts on this?

At least that they don't throw errors — meaningful results would be nice but that is out of scope here I think. We just want to make sure that the different formats are compatible with different estimators. I will follow up in that PR with some thoughts on this.

Thank you!

mortonjt · August 17, 2018, 7:42pm

wrt to the correlated features, it's the other way around -- the features are already correlated since they are proportions. The question is how can one properly account for the sample space. The ilr actually removes these sorts of constraints.

philr has some benchmarks comparing different machine learning techniques against ilr transformed data, and it seems to boost accuracy: A phylogenetic transform enhances analysis of compositional microbiota data | eLife

The biggest thorn against these sorts of approaches are properly dealing with zeros, which can still be tricky to deal with. But there are multiple legit ways around this -- will be happy to discuss if there is interest.