--p-optimize-feature-selection/--p-no-ptimize-feature-selection outputs are not inclusive?

Decen · July 10, 2019, 12:48pm

when I followed the qiime2 tutorial (I used the "moving-pictures data"), I ran into a problem, which made me confused.
in the q2-sample-classifier plugin, if I use the parameter --p-optimize-feature-selection, the amount of feature in "feature_importance.qza" fold is larger than that of in the same fold using --p-no-optimize-feature-selection. Here, I can understand. But when I copied the features from the two important.tsv files, and compared the two batches, I found most of features in two batches/.tsv files are overlapped.

. Anyway, in my understanding, I think the features under "--p-optimize-feature-selection" parameter should cover all features under "--p-no-optimize-feature-selection" parameter. It looks so weird. However, I think "--p-optimize-feature-selection" is better to the accuracy of model.

Another question is: the parameter "--p-n-estimators". Based on your explanation, if the number increases, the accuracy will go up, for instance, from "--p-n-estimators 20" to "--p-n-estimators 100" , while I also found the accuracy decreased.

Nicholas_Bokulich · July 11, 2019, 12:12pm

Yes in theory the optimized features should be a subset of not optimized, since not optimized should use all features. Some of your features may be left out if you are not using enough trees, since each tree only uses a subset of features.

Generally more trees results in more accuracy, but nothing is certain — if you are getting lower accuracy with more trees, it is just stochastic. You are randomly selecting a different set of training and test samples, so that alone will have a potentially large impact on accuracy especially if you have a small number of samples.