How does qiime2 deal with unseen or missing features in the process of predicting new samples using pre-trained RF model??

hjsung13 · June 13, 2022, 6:13pm

Hi,

I'm currently working with q2-sample classifier (qiime2 2021-11 version, python 3.8.12 version) package in order to classify samples by ASV features of them.

I got sample-estimator (RandomForestClassifier) using Discovery cohort, and used this trained sample-estimator.qza to predict disease status of Validation cohort.

*The number of ASV features between Discovery cohort and Validation cohort are quite different.

Code I wrote below was working well in qiime2 environment.

$ qiime sample-classifier classify-samples
--i-table Discovery_cohort_ASV_table.qza
--m-metadata-file Metadata.tsv
--m-metadata-column Disease
--p-test-size 0.3
--p-cv 10
--p-random-state 1
--p-n-estimators 100
--p-parameter-tuning
--p-optimize-feature-selection
--output-dir Randomforest_result
--p-n-jobs 8

$ qiime sample-classifier predict-classification
--i-table Validation_cohort_ASV_table.qza
--i-sample-estimator sample_estimator.qza
--p-n-jobs 8
--output-dir validation_result

However, as I know, RandomForestClassifier from sklearn couldn't predict ASV table including unseen or missing features.

In fact, when I trained the sample-estimator directly by the sklearn (ver. 1.1.1) in python(ver. 3.10.4), the trained model couldn't predict Validation cohort due to the unseen and missing features.

Codes:

1 clf = RandomForestClassifier(random_state = 1, n_jobs = 8, n_estimators = 100)
2 clf.fit(X_train_discovery, y_train_discovery)
3 clf.predict(X_validation, y_validation)

Then, error occurred.

~/.local/lib/python3.10/site-packages/sklearn/base.py:493: FutureWarning: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- 02208210cc2c91fe29bd8be00dd7b9ba
- 02de9d9e2b4480e19f20b3031e5c4411
- 02fc4eaa86303ab33f7a1604f8582c1d
- 0732896b40cf52d4964340bc916324e2
- 07e31a0e839c693bbcbeccf21917199d
- ...
Feature names seen at fit time, yet now missing:
- 04bad4ee94d751778552d22032a3b365
- 0cc84c55622db9f60e77c05b8df2e37a
- 0db760974e0df82b58e34de1383602b5
- 0dc3c64677fa30c458a066cd3f70d17f

Thus, I wonder how does "qiime2 sample-classifier predict-classification" solve this problem raised by the unmatched ASV features?

Is it just matter of difference of sklearn version? Or are there anything I missed?

Thanks in advance for you help.

lizgehret · June 28, 2022, 5:44pm

Hi @hjsung13,

Thanks for your patience here! We are checking in with the developer who created the feature extraction method, and will circle back once we hear back from them. Hang tight!

lizgehret · July 11, 2022, 9:36pm

Hi @hjsung13,

Thanks so much for your patience here! I received some insight on this from one of our other moderators that should be helpful:

q2-sample-classifier uses DictVectorizer for feature extraction as part of a model fitting pipeline. This vectorizes categorical feature ids (using a one-hot encoding), effectively mapping features to a larger possible namespace so that missing and unseen features are handled appropriately at the classification step (i.e., as NaN and ignored, respectively). Raw tables are not passed directly to the classifier for fitting or for prediction. See the docs and examples here:

I hope this helps! Cheers

system · August 12, 2022, 3:37am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.