Multiple regression?

In my project, I sampled 7 different body sites after death on a period of 7 months. I now want to know if we can use the microbiome to predict the time of death of somebody. To do that I ran a random forest regression with q2-sample-classifier for each site. The problem is that since I have 15-20 samples per body site, even if I test the regression with a --p-test-size at 50% I end up with very few results and those are not very good.

I was wondering if there was a way to do a multiple regression with qiime2, that way I could combine the sites also and see if a combination of sites would be better for prediction and if so, which combination is the best. In that way, I will be able to test multiple factors.

Thank you very much!


Hello Audrey-Anne,

I'm not sure how well this will work on your data set, but you may find this plugin helpful. Some functions do support multiple regressions and mixed effects.

Time-series data is tricky! :thinking: Let us know what you find!

:skull: :fast_forward: :timer_clock:

1 Like

Hi Colin,

I did use this function with my data, I did a LME and a volatility analysis. It was great to understand what happened with the microbiome during the decomposition more as fundamental research. Unfortunately, I could not use this function to predict a post-mortem interval. I need a function that would analyze the different body sites and come with a model that would tell me which combination of sites gives the most accurate prediction. In the literature, some use the random forest regression, but it does not seem to work for me, that is why I thought a multiple regression might be more helpful.

But yeah, I do realize that time-series data is tricky, but I love a challenge :stuck_out_tongue:

Thank you so much!


This sounds like feature extraction from a model and I don't know how to do that from LME. Fitting random-forrest then using leave-one-out to identify important features is one method.

Time-series data has one-way causality, so I've seen people use causal inference on data like this. I wonder if @jwdebelius has done this before.

1 Like

Hi @colinbrislawn and @Audrey_Anne,

Kind of, although not in this way. Like, I've done dynamics predicting outcome, and I've done dynamics as an outcome, but I've never done repeated measures predicting time.

I think I'm also struggling with how the model gets integrated. It seems like you have a relatively small sample size. Even if we assume bodysties are mostly independent (noting that :mouse2: are coprophagic assholes, not everyone washes their hands well, and oral microbe translation seems to be a mark of poor health), there's still an issue of repeated measures through time. I dont know if/how a random forest model would address this.

I have a broader concern (cue eye rolling from friends), which is your sample size. I don't know that you can do causal inference on what looks like it might be 2(?) bodies with repeated sampling. I worry about any model being over fit, because it's based on a single person, and I think that's a major factor to consider when you look at modeling.

I think also deciding on what your assumptions around future classification are matters. (More eye rolling as I get philosophical). So, like, do you think this classifier will help predict time since death when a body is found in a field? Are you planning to follow bodies over time to see if you can predict the change?

If you think your current data supports the model, then I think you do have to account for it. I'm personally moving toward using more continuous derived log ratios in my work (i.e. model the data, construct an ALR, treat the ALR as an independent variable in my model). In theory, you could do this across multiple body sites in a training set, throw them all into some kind of regression for age (although I'm not sure QIIME 2 can do this), and then use the regression model as a classifier/predictor. (Apparently just linear regression is a classifier :woman_shrugging:). I've used code from Jamie Morton to build an ALR on an LME). It's a little bit fragile with Stan, so that's kind of a downside.

I think my best advice is actually to find a statistician and/or machine learning expert to collaborate with. The kind of modeling you need to do here requires some serious expertise, and will also require knowedlge/insights about some of the specifics of your project.

Sorry I dont have a tidy solution.