I just read Roesch et al. 2019 article about their new PIME R-package. Stevens et al. 2020 also used this new tool to differentiate DEPR and NODEP groups based on microbiome. Does anyone find this interesting?
How I understood PIME: The method uses Random Forest models to select a prevalence level to filter data inside a study group (for example DEPR), this seems like there is potential for overfitting the dataset for further analyses. How PIME tries to control this is doing 100 random splits and then compares their OOB error against 100 repeats of the real group splits. This only tests if the method finds completely random patterns to differentiate artificial groupings and in my opinion does nothing to battle the overfitting. I feel this is similar to using differential abundance testing first to find significantly different taxa using the whole dataset and then using only these taxa in further analyses, like machine learning classification, which causes information to leak from test set to train set.
Additionally the method seems to use OOB accuracy as a metric to select the prevalence level, which as a metric just doesn’t work in unbalanced datasets. I have not yet dived into the source code, but would imagine some sort of resampling that balances the groups happens before training.
What does everyone else think about this? Is there potential for a QIIME -plugin?