sample reduction based on model importance in classify-sample-ncv vs. sample-classifer

jrw187 · March 29, 2021, 11:18pm

I am curious about options for feature reduction if you are using classify-samples-ncv vs. sample-classifier. With sample-classifier (random forest), you can use --p-optimize-feature-selection to specify recursive feature elimination, which can help with dimensionality reduction and down-stream analysis. This is not an option with classify-samples-ncv. So how would you go about doing something similar in a statistically relevant/justifiable way?

I have a dataset that I ran classify-samples-ncv (random forest) on. The accuracy was pretty good. But I wanted to see what would happen if I reduced features. So I used the importance .qza table and reduced to the top "X" number of features and re-ran the classifier. I did this repeatedly reducing "X" by 5 features each time. As I did this, model accuracy improved to a point, plateaued, and then at a certain point began to decline. I would say that is where I would want to cut things off. But I am pretty sure that this is not the proper method/approach to achieve feature reduction. Is there some option with classify-samples-ncv that I am missing that would approximate this approach or what is achieved through recursive feature elimination with sample-classifier?

Thank you!

adamova · March 31, 2021, 10:56am

Dear @jrw187,

Your impression is correct, that there is no implemented way in classify-samples-ncv to directly perform feature reduction (as in classify-samples).

If you perform feature elimination with the results returned from classify-samples-ncv, you have a data leakage problem. More precisely, with the feature importances returned by classify-samples-ncv being averaged across all nested cross-validation folds, you are not strictly keeping train and test set separate.
So, when using classify-samples-ncv there is currently no correct way of performing feature elimination. We will need to implement the feature selection option within the inner nested loop of classify-samples-ncv. I have opened a new issue on q2-sample-classifier such that this option becomes available in the future.

That said, the importance scores from classify-samples-ncv can be used to identify the most predictive features. I suggest you revert to using classify_samples (has a fixed train/test split) when requiring a model trained with an optimised amount of features. If you are just interested in knowing about the most predictive features, you can use classify-samples-ncv as is.

I hope this helps. Please let me know if anything remains unclear.

system · May 1, 2021, 5:08pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.