PIME (2019) prevelance-filtering method discussion

Hello everyone,

I just read Roesch et al. 2019 article about their new PIME R-package. Stevens et al. 2020 also used this new tool to differentiate DEPR and NODEP groups based on microbiome. Does anyone find this interesting?

How I understood PIME: The method uses Random Forest models to select a prevalence level to filter data inside a study group (for example DEPR), this seems like there is potential for overfitting the dataset for further analyses. How PIME tries to control this is doing 100 random splits and then compares their OOB error against 100 repeats of the real group splits. This only tests if the method finds completely random patterns to differentiate artificial groupings and in my opinion does nothing to battle the overfitting. I feel this is similar to using differential abundance testing first to find significantly different taxa using the whole dataset and then using only these taxa in further analyses, like machine learning classification, which causes information to leak from test set to train set.

Additionally the method seems to use OOB accuracy as a metric to select the prevalence level, which as a metric just doesn’t work in unbalanced datasets. I have not yet dived into the source code, but would imagine some sort of resampling that balances the groups happens before training.

What does everyone else think about this? Is there potential for a QIIME -plugin?

Roesch et al. 2019: https://www.biorxiv.org/content/10.1101/632182v1
Stevens et al. 2020: https://www.nature.com/articles/s41380-020-0652-5

1 Like

Welcome to the forum, @pvan!

I have not read the paper yet in detail, but I agree with you that both overfitting and imbalanced classes seem like issues with the method (as they always are if careful control is not taken).

In addition, the way that prevalence filtering is performed on each group independently is unclear and rather suspicious: i.e., if a feature is high prevalence in group A but less prevalent in group B, is it kept in A and removed from B? As far as I can tell, the OOB error is estimated after prevalence filtering on all samples. If this is the case, then the filtering is occurring in a supervised manner and leaking information to the validation step. But maybe the authors or someone who’s taken time to thoroughly review the paper can comment.