Hello QIIME2 team!
I am training a Random Forest classifier for predicting body sites. I have 8 training classes (8 body sites), a total of 145 samples and 3619 OTUs across all samples. I am using a 75-25 train test split with RFE and am obtaining an overall accuracy of 0.83 and the RFE picked 158 features across the 108 samples. This seems to work well.
Now, the next step would be to predict the body sites for a separate set of 37 samples. I was advised by a colleague that if I am happy with the perfomance of the classifier, I could use the entire set of 145 samples to train another classifier for predicting the separate set. However, when I do this, RFE picks 3259 OTUs. I would very grateful for any insights on this steep increase in the number of features- RFE picks 158 features out of 108 samples in the training set and 3259 features when all 145 samples are included.
In addition, I was wondering if it is okay to train the second classifier (with all 145 samples) while using the set of 158 features?
Thank you so very much!
Hi @meghna_swayambhu ,
You could look at the RFE curve to see where it levels off — similar to a rarefaction curve. It is quite possible that the automated feature selection is selecting 3619 OTUs as the best because it maximizes accuracy, but a much lower number may be "good enough" (so again, similar to selecting a sampling depth based on a rarefaction curve). There is also going to be a certain amount of stochasticity, varying based on which samples are put in the training set and if you set a random seed. So I would not be too concerned with this observation — look at the RFE curve and importance scores and tinker a little bit to see if you can eke out a similar accuracy with a lower feature count (as I understand that minimizing feature count is an important goal in your experiment, not only maximizing accuracy).
Good luck!
2 Likes
Hi Nicholas,
Thank you so much for your response! Yes, minimizing the features is an important aspect of my experiment. This approach sounds great. I looked at the RFE curve and the .tsv file and it seems like 739 features provide an accuracy of 0.855 (compared to the 0.862 for 3259 features). I am wondering what may be the best way to extract these feature IDs. Are these perhaps the top 739 in the feature importance file?
Thank you very much!
1 Like
Yes the feature importances should correspond to the RFE results. This is how RFE (actually RFECV) decides which features to use in each step, it gradually throws out an increasing number of low-importance features based on the initial ranking. You can read more here:
So Yes, you should be able to just take the feature importance scores from your model and select the top N most important and re-train your model.
Hi Nicholas,
Thank you for your help and for sharing the link!
Meghna