I'm currently working with q2-sample classifier (qiime2 2021-11 version, python 3.8.12 version) package in order to classify samples by ASV features of them.
I got sample-estimator (RandomForestClassifier) using Discovery cohort, and used this trained sample-estimator.qza to predict disease status of Validation cohort.
*The number of ASV features between Discovery cohort and Validation cohort are quite different.
Code I wrote below was working well in qiime2 environment.
$ qiime sample-classifier classify-samples --i-table Discovery_cohort_ASV_table.qza --m-metadata-file Metadata.tsv --m-metadata-column Disease --p-test-size 0.3 --p-cv 10 --p-random-state 1 --p-n-estimators 100 --p-parameter-tuning --p-optimize-feature-selection --output-dir Randomforest_result --p-n-jobs 8
$ qiime sample-classifier predict-classification --i-table Validation_cohort_ASV_table.qza --i-sample-estimator sample_estimator.qza --p-n-jobs 8 --output-dir validation_result
However, as I know, RandomForestClassifier from sklearn couldn't predict ASV table including unseen or missing features.
In fact, when I trained the sample-estimator directly by the sklearn (ver. 1.1.1) in python(ver. 3.10.4), the trained model couldn't predict Validation cohort due to the unseen and missing features.
1 clf = RandomForestClassifier(random_state = 1, n_jobs = 8, n_estimators = 100) 2 clf.fit(X_train_discovery, y_train_discovery) 3 clf.predict(X_validation, y_validation)
Then, error occurred.
~/.local/lib/python3.10/site-packages/sklearn/base.py:493: FutureWarning: The feature names should match those that were passed during fit. Feature names unseen at fit time: - 02208210cc2c91fe29bd8be00dd7b9ba - 02de9d9e2b4480e19f20b3031e5c4411 - 02fc4eaa86303ab33f7a1604f8582c1d - 0732896b40cf52d4964340bc916324e2 - 07e31a0e839c693bbcbeccf21917199d - ... Feature names seen at fit time, yet now missing: - 04bad4ee94d751778552d22032a3b365 - 0cc84c55622db9f60e77c05b8df2e37a - 0db760974e0df82b58e34de1383602b5 - 0dc3c64677fa30c458a066cd3f70d17f
Thus, I wonder how does "qiime2 sample-classifier predict-classification" solve this problem raised by the unmatched ASV features?
Is it just matter of difference of sklearn version? Or are there anything I missed?
Thanks in advance for you help.