How does qiime2 deal with unseen or missing features in the process of predicting new samples using pre-trained RF model??


I'm currently working with q2-sample classifier (qiime2 2021-11 version, python 3.8.12 version) package in order to classify samples by ASV features of them.

I got sample-estimator (RandomForestClassifier) using Discovery cohort, and used this trained sample-estimator.qza to predict disease status of Validation cohort.

*The number of ASV features between Discovery cohort and Validation cohort are quite different.

Code I wrote below was working well in qiime2 environment.

$ qiime sample-classifier classify-samples
--i-table Discovery_cohort_ASV_table.qza
--m-metadata-file Metadata.tsv
--m-metadata-column Disease
--p-test-size 0.3
--p-cv 10
--p-random-state 1
--p-n-estimators 100
--output-dir Randomforest_result
--p-n-jobs 8
$ qiime sample-classifier predict-classification
--i-table Validation_cohort_ASV_table.qza
--i-sample-estimator sample_estimator.qza
--p-n-jobs 8
--output-dir validation_result

However, as I know, RandomForestClassifier from sklearn couldn't predict ASV table including unseen or missing features.

In fact, when I trained the sample-estimator directly by the sklearn (ver. 1.1.1) in python(ver. 3.10.4), the trained model couldn't predict Validation cohort due to the unseen and missing features.


1 clf = RandomForestClassifier(random_state = 1, n_jobs = 8, n_estimators = 100)
2, y_train_discovery)
3 clf.predict(X_validation, y_validation)

Then, error occurred.

~/.local/lib/python3.10/site-packages/sklearn/ FutureWarning: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- 02208210cc2c91fe29bd8be00dd7b9ba
- 02de9d9e2b4480e19f20b3031e5c4411
- 02fc4eaa86303ab33f7a1604f8582c1d
- 0732896b40cf52d4964340bc916324e2
- 07e31a0e839c693bbcbeccf21917199d
- ...
Feature names seen at fit time, yet now missing:
- 04bad4ee94d751778552d22032a3b365
- 0cc84c55622db9f60e77c05b8df2e37a
- 0db760974e0df82b58e34de1383602b5
- 0dc3c64677fa30c458a066cd3f70d17f

Thus, I wonder how does "qiime2 sample-classifier predict-classification" solve this problem raised by the unmatched ASV features?

Is it just matter of difference of sklearn version? Or are there anything I missed?

Thanks in advance for you help.