RFE scores and accuracy do not match

I am running the q2-sample-classifier for training a random forest classifier with RFE. My input file is actually a metabolomics abundance table and therefore the features are metabolites. When looking at the accuracy results, the overall accuracy seems to be 0.97 with 532 selected features. However, when I look at the RFE scores, the accuracy for the 532 features is 0.98. I was wodering why these two accuracies differ for the same number of features? Any isight would be greatly appreciated! Thank you very much!

importance.tsv (20.0 KB)
predictive_accuracy.tsv (582 Bytes)
rfe_scores.tsv (544 Bytes)

Hey @meghna_swayambhu,

(@Nicholas_Bokulich, please correct me if I'm wrong)

I think this is because the model summary which has the RFE scores does not consider the hold-out data, but only the model-fit to the training data. There wouldn't be much point in using the hold-out/test data to guide the feature-selection, since you'd just overfit to your test data.

Given your accuracy with the hold-out/test data is .97 (vs .98 for the training data itself), it seems like classification went extremely well.

3 Likes

Hello,
Ah yes, you are absolutely right! The RFE is reported on the training data and the accuracy is the test data. Thank you very much.

Meghna