Random Forest model with environmental longitudinal data

nkhadland · October 30, 2024, 9:48pm

Hi all,

I have 16S data from environmental samples (rocks) collected at irregular intervals as a fixed-site time series. We have multiple sites, some of which became inaccessible over the course of the study or we started sampling at a later time.

I am interested in training a random forest model to predict the age of a sample using the qiime sample-classifier regress-samples plugin.

I was wondering if anyone could provide insight on the feasibility of using a random forest model on longitudinal data like this? Based on the reading I have done (e.g., see this paper), I believe fixed-site longitudinal data violates the assumptions of a random forest model since it is correlated data.

I have of course used linear mixed effects models on my data (as well as other suggested longitudinal analyses in this tutorial), but I thought it would be very interesting to be able to predict the age based on all of the features.

Perhaps a way to do this would be to perform site-level bootstrapping? Although I do not think qiime can do this at the moment.

Any suggestions on how to proceed would be greatly appreciated.

Nathan

jphagen · November 1, 2024, 5:28pm

Hi @nkhadland,
There is a new plug-in for qiime2 for bootstrapping! q2-boots
I hope this is helpful to you!
--Hannah

nkhadland · November 2, 2024, 1:59am

Hi Hannah,

Thanks for pointing to this. If I understand correctly, random forest models like in qiime sample-classifier regress-samples have bootstrapping built in by definition. I was more curious if metadata defined (e.g., site-level) bootstrapping was possible in qiime, and if that would solve the correlated data problem associated with fixed-site longitudinal data (or if that is a problem I should be concerned with at all). It doesn't appear that the new plugin you suggested works with sample-classifier.

I’m open to suggestions for alternative routes. Perhaps this is a long shot, but does anyone have experience in predictive sample classification with longitudinal data, such as mixed effects regression tree/forest (MERT) models?

Thanks

Nathan

Nicholas_Bokulich · November 3, 2024, 7:05am

Hi @nkhadland ,
If I understand correctly (but I skimmed the paper quickly so maybe I missed the point), it sounds like the concerns raised in that paper about use of RF models with longitudinal data is rather about forecasting outcome targets, e.g., training a model on longitudinally sampled data to predict if a patient will develop disease, or if a soil sample is polluted. The reason being that autocorrelation between these samples (as the same site/patient is sampled repeatedly and will share many of the same features over time) will lead to information leakage as the target value (e.g., outcome, or if a site is polluted) is constant across all samples from the same site. Hence why different stratification procedures are proposed, e.g., for training on some sites and testing on a hold-out set of sites not used in training.

However, in your case you do not want to predict an outcome or target that is dependent on site (e.g., predicting an outcome that is fixed for each site, or predicting which site a sample came from), you want to predict timepoint, which is independent of site.

There is quite some precedent for predicting timepoint with RF models, e.g., for predicting age as done in this paper: Persistent gut microbiota immaturity in malnourished Bangladeshi children | Nature

So as long as you do not have, e.g., duplicate samples from a site that could leak information about the temporal signature at that site, it should be okay to try predicting timepoint in your data with a RF regressor.

You could also try cross-validating across site — this would be possible with q2-sample-classifier plus a bit of python (or technically using q2-feature-table, but this would be a little more awkward) to create your training-test set split manually by holding out one site at a time and training your RF regressor directly on the remainder, then predicting values for the hold-out set. This latter approach is basically what is described in that paper, and should be robust (as long as you do not have correlation between sites, e.g., because they have spatial autocorrelation effects), but it might not be necessary to take such a complicated approach given that you want to predict time, not a site-dependent target.

nkhadland · November 4, 2024, 9:07pm

HI @Nicholas_Bokulich

Thanks so much for this context, this is really helpful!

Nathan

system · December 6, 2024, 3:07am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.