Hi,
I'm currently working with q2-sample classifier (qiime2 2021-11 version, python 3.8.12 version) package in order to classify samples by ASV features of them.
I got sample-estimator (RandomForestClassifier) using Discovery cohort, and used this trained sample-estimator.qza to predict disease status of Validation cohort.
*The number of ASV features between Discovery cohort and Validation cohort are quite different.
Code I wrote below was working well in qiime2 environment.
$ qiime sample-classifier classify-samples
--i-table Discovery_cohort_ASV_table.qza
--m-metadata-file Metadata.tsv
--m-metadata-column Disease
--p-test-size 0.3
--p-cv 10
--p-random-state 1
--p-n-estimators 100
--p-parameter-tuning
--p-optimize-feature-selection
--output-dir Randomforest_result
--p-n-jobs 8
$ qiime sample-classifier predict-classification
--i-table Validation_cohort_ASV_table.qza
--i-sample-estimator sample_estimator.qza
--p-n-jobs 8
--output-dir validation_result
However, as I know, RandomForestClassifier from sklearn couldn't predict ASV table including unseen or missing features.
In fact, when I trained the sample-estimator directly by the sklearn (ver. 1.1.1) in python(ver. 3.10.4), the trained model couldn't predict Validation cohort due to the unseen and missing features.
Codes:
1 clf = RandomForestClassifier(random_state = 1, n_jobs = 8, n_estimators = 100)
2 clf.fit(X_train_discovery, y_train_discovery)
3 clf.predict(X_validation, y_validation)
Then, error occurred.
~/.local/lib/python3.10/site-packages/sklearn/base.py:493: FutureWarning: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- 02208210cc2c91fe29bd8be00dd7b9ba
- 02de9d9e2b4480e19f20b3031e5c4411
- 02fc4eaa86303ab33f7a1604f8582c1d
- 0732896b40cf52d4964340bc916324e2
- 07e31a0e839c693bbcbeccf21917199d
- ...
Feature names seen at fit time, yet now missing:
- 04bad4ee94d751778552d22032a3b365
- 0cc84c55622db9f60e77c05b8df2e37a
- 0db760974e0df82b58e34de1383602b5
- 0dc3c64677fa30c458a066cd3f70d17f
Thus, I wonder how does "qiime2 sample-classifier predict-classification" solve this problem raised by the unmatched ASV features?
Is it just matter of difference of sklearn version? Or are there anything I missed?
Thanks in advance for you help.