Importing trained sample-classifier in python/R in order to access feature importance "orientation"

ncaramello · November 2, 2020, 12:57pm

Dear all,

I used regress-samples (with RandomForest as the estimator) in order to identify the features contributing the most to a given Behavioral Score (Time spent in social contact).

"
qiime sample-classifier regress-samples
--i-table $data-social-table.qza
--m-metadata-file metadata.tsv
--p-test-size 0.05
--p-n-jobs 4
--m-metadata-column Time_in_Social_Contact
--p-estimator RandomForestRegressor
--p-random-state 123
--p-optimize-feature-selection
--output-dir $data-social-TimeSC
"

Then I took a peek at FeatureImportance, which was almost exactly what I was looking for. The only issue I have is that FeatureImportance does not specify whether a feature is contributing towards "high" or "low" "time spent in social contact". In other words, whether the coefficient for that feature is positive or negative.

I know that with categorical data I should be able to extract a vector of "feature contribution" with treeinterpreter from the RandomForest directly in Python.
I was hoping to do just that here.

Thus, I exported the Sample-estimator
"
qiime tools export
--input-path $data-social-TimeSC/sample-classifier.qza
--output-path $data-social-TimeSC/class
"

This produced a pipeline_sklearn.pkl which I should be able to import in Python.
However, every attempt at importing the pickled pipeline produced the following error :

(apologies for the screencap, i do not quite know how to format code on the forums).

I know so far that it isn't due to scikit-learn version as I have installed scikit-learn=0.23.1 which Qiime2 is using.
I can't seem to find a way to solve this issue on the qiime2 forums nor in scikit-learn related forums.

Does anyone know how to solve the import issue ?
Or otherwise, how to access the "orientation" of feature importance ?

The only solution I can think of is actually running the RandomForest classifier outside of Qiime and directly in R or python, which is much less convenient.

This is quite weird and I feel as though I am missing an obvious piece of the problem. Accessing the "sign" of "RandomForest regressor coefficients" seems like a basic request and I am puzzled not to find any related topics to that on the forums. So apologies if the answer is obvious.

As a side note, I would like to thank the mods and the community for the constant technical support.
This is the first time that I ever have a question for which I can't find an answer on the forums.

adamova · November 3, 2020, 10:59am

Hi @ncaramello,

To read the RandomForest regressor you can do the following:

import joblib

# to load sklearn pipeline
loaded_pipeline = joblib.load('sklearn_pipeline.pkl')

# extract estimator from pipeline (produces output of type "sklearn.ensemble._forest.RandomForestRegressor")
loaded_model = loaded_pipeline['est']

Then you can use loaded_model directly in Python with treeinterpreter.
As you correctly noticed, getting a feature "sign" information in a non-linear model like RandomForest is not straightforward. Which is why there are many methods available to analyse the impact of a feature on a complex model (treeinterpreter, SHAP, LIME, partial dependency plots, ...).
FYI: The feature importances outputted by RandomForestRegressor in ScikitLearn (and with that also in qiime sample-classifier regress-samples) are "Gini imporance" scores which per definition do not contain a "sign" information (nice explanation of "Gini imporance" score here).

As for getting an insight into features from the trained regressor directly in QIIME2, there is the option of visualising the abundances of the top features in each sample grouped by the target. By binning your continuous target into discrete categories ("Time_in_Social_Contact_binned"), you can generate an abundance heatmap with:

qiime sample-classifier heatmap \
  --i-table data-social-table.qza \
  --i-importance feature_importance.qza \
  --m-sample-metadata-file metadata.tsv \
  --m-sample-metadata-column Time_in_Social_Contact_binned \
  --p-group-samples \
  --o-filtered-table important-feature-table.qza \
  --o-heatmap feature-heatmap.qzv

If you are less interested in the model's performance and more in the interpretability on the feature's influence on the target, you could also consider training a simpler model that only captures linear effects, such as Linear Regression. Here the interpretation of the features' influence (and "sign") is straightforward.

Hope this helps.

ncaramello · November 3, 2020, 11:49am

Hello @adamova,

Thank you very much for your answer !
Obviously I should have tried joblib as well as pickle.
It works like a charm.

FYI: The feature importances outputted by RandomForestRegressor in ScikitLearn (and with that also in qiime sample-classifier regress-samples ) are “Gini imporance” scores which per definition do not contain a “sign” information (nice explanation of “Gini imporance” score here).

Thank you for the explanation. I had a hunch that my simplistic idea of "coefficients" was far from the truth.
I had hoped to extract the average contribution across all trees in the forest from my trained classifier using treeinterpreter (for each feature).
Thanks to your explanations, I am now beginning to understand how needlessly convoluted this approach is. Especially considering that I don't care about model performance.
All in all, I am still more interested in extracting features that impact my target variable the most. Interpretability of the feature's influence is icing on the cake.
And I am afraid that a linear regression would be "skewed" by the very low sample/feature ratio.
Thus I am thinking of training a penalized linear regression model as an adequate middle ground.
That would allow me to retain the ability to "separate the wheat from the chaff" while outputting straightforward "coefficients" for each feature.
Do you think this is a relevant approach ?

In any case, thank you so much for your time and your detailed answer, I now understand RandomForest a good deal better.

adamova · November 3, 2020, 1:22pm

Hi @ncaramello,

Happy to hear that the answer was helpful .

Using penalized linear regression sounds like a reasonable approach. By using Lasso (L1) instead of Ridge (L2) regularisation, you will not only avoid overfitting but also get a restricted amount of relevant model features (somewhat performing feature selection while fitting the model).

Just be aware, that by using linear regression your model can only capture linear effects of features on the target variables. To get a feeling of how much predictive power is lost by that, I advise you to compare the model performances of the linear regression with the RandomForest model.

Have fun.

thermokarst · November 3, 2020, 2:18pm

Thanks for this great post, @adamova!

@ncaramello, just to tie this together with your earlier efforts that involved exporting data etc, here is a solution that uses the QIIME 2 Artifact API (the python interface, which will streamline your efforts a bit):

import qiime2
from sklearn.pipeline import Pipeline

classifier_artifact = qiime2.Artifact.load('sample-classifier.qza')
pipeline = classifier_artifact.view(Pipeline)
loaded_model = pipeline['est']

The classifier_artifact.view(Pipeline) uses a QIIME 2 transformer to handle the conversion for you.

:qiime2:

system · December 4, 2020, 11:45pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.