modeling bacterial features as indepedent variables

kam · March 23, 2023, 4:12pm

A follow-up question regarding your suggested methods: As microbiome data is complicated enough, longitudinal analysis make metters worth

jwdebelius:

Predicting outcome/single microbe exposure relationships

So, if sample classifiers wont give you what you need, why not just flip to a classic regression? You could theoretically just do the CLR transform yourself, write a for-loop, and then crank through all those models and come out with an OR, RR, or beta on the other side. (y ~ , x_{0} ~ ). There are lots of libraries that will do an FDR correction. And, on a simple level, you've sloved the problem. Yay!

Except that, as long as you don't have a time to failure component in your model*, the interpretation of (y ~ , x_{0} ~ ) and (x_{0} ~ , y ~ ) shouldn't be any different if you have a single microbe you're looking at. The coeffecients won't be different, but I'm not sure it matters if cases have 2x compared to controls, or that every time you double , your odds of being a case increase. (Im not caffienated enough to do that particular math). For continuous outcomes, it's pretty much just algebra. (if y=mx+b, then x=\frac{1}{m}(y-b)).

Based on that, my recommendation is to stick with your single direction standard tools that already know your data, and just think about how you frame things in your results/interpretation/discussion.

The one exception might be if you have a list of organisms a priori that you think are keystone or you want to reproduce. You need to think carefully about how you pick those, and how things line up, but cramming them through a x_{0} ~ model might make sense and make your life better. Just, like, FDR correct your data and be clear about why you picked your list.

You proposed the rational solution of CLR transformation. How would you approach this if, for example, each participant has two samples (time 1 and time 2) and you would like to model the change for each feature as the independent variable in the regression? One solution that might seem possible is a simple subtraction of CLR values (CLR time 2 - CLR time 1) for each feature and each participant. Another possible solution might be Ln(relative abundance time 2 / relative abundance time 1), but this obviously does not account for compositionality. I'm interested in how you would approach this.

jwdebelius:

Sample Classifiers

Okay, so given our as exposure, as outcome relationship, you can use a classifier to build a model that will use the microbes (and whatever else you want) to predict the outcome. (There's a tutorial for the qiime2 plugin, if its of interest.

"However, like with most things, it's more complicated than just "yes" A sample classifier usually has a step where it selects features. Some people use differential abundance ( ~ y) to pick those features that then get fed into the classifier. Those get refined through what sits somewhere between computation and Arthur C Clarke-esque magic ). You classifier will produce feature ranks, but it won't give you a nice table of coeffecients with beta, RRs or ORs and/or p-values that you might use to make your clinical collaborators happy.

So, while this is definitely an option, it may not be what you're looking for.

Longitudinal analysis also poses difficulties when using classifieirs. Even the (liberal?) methods I have proposed in the previous section may not work here, as suggested in this post Could q2-sample-classifier take relative abundance or CLR transformed abundance as input? - #2 by Nicholas_Bokulich @Nicholas_Bokulich . Is there a way to account for longitudinal changes per feature while using classifiers?

Thanks and happy to continue this discussion.