# modeling bacterial features as indepedent variables

Hello community!

One of the commonly used methods in this field is to test the association of each feature separately, using packages such as Maaslin2 and ANCOM-BC (which was recently added thanks to @lizgehret!). These packages enable the modeling of independent variables and testing each feature as the dependent variable. For example, if there are variables of weight, sex, and bacterial features, using Maaslin2/ANCOM-BC (as well as other packages and approaches), one can test the dependence of bacterial features on weight and sex (or their interaction, etc.).

However, an alternative approach could be to consider the features as an independent variable. For e.g., the formula could look like: feature + sex ~ weight, testing each feature.
Are there any packages or ideas for performing this type of analysis? (Using a simple example here, of course.)

2 Likes

Hi @kam,

Can I say it's complicated, it depends . (Also I've been trying to write this for 3 days, but now I have a new cup of tea and data computing for three projects, so its a good time to try and put my thoughts into words. Possibly.) I'll also preface this by saying that I work with a mix of epidemiologists and clinical researchers although I'm not either, and so while I'm going to borrow language, it's not going to be a perfect reading of any discipline.

Some Terms to make life easier

For the sake of all our conversations, I want to start by defining some terms.

From a conceptual perspective, let's say that we're interested int he relationship between microbes, weight, and age.

Given our conceptual model, let's call our "exposure", our "outcome" and a "confounder". We can describe these relationships with a series of equations:

(We could have drawn this as a DAG, but Im not that good at emojis.) We essentially want to understand ( , accounting for ). Whether this is the correct set of relationships and the directionality of those arrows is something we could spend hours arguing about, as well as how we test that experimentally and mathematically. It's a whole field of study on its own. So, let's just accept these for now.

Mathematically, we typically represent this relationships with a series of equations where we have a set of independent variables ({x_{0}, x_{1}... x_{i}) and a dependent variable (y), where we have some relationship,

y \approx \sum{\beta_{i}x_{i}}

for some appropriate linkage and distributional assumptions.

I'm going to assume that in your case, your conceptual model is one where , and you'd like your mathematical model to reflect that because it will probably be easier to interpret. (I'm leaving aside a tangent here, but let's accept that premise). It would likely make your results easier to interpret/sell if our dependent variable is associated with our outcome (y ~ ) and our exposure is our independent variable (x_{0} ~ ), and your data has a single fixed outcome, rather than a time to failure scenario. So, like, it could be a case/control or nested case/control study, a longitudinal study, whatever, as long as we're not worried about the time to out that outcome.*

Data Caveats

I'm also just going to remind myself that the microbiome data we're analyzing is high dimensional, sparse, and compositional. We can deal with some of the dimensionality and sparsity by filtering, and do transforms to deal with the compositionality. But, we're ultimately left with hundreds of features, the majority of which are present in 10-20% of our samples. We also broadly work under the assumption that ther are a subset of key features that are driving differences, and a bulk of features that aren't changing. That was an explicit assumption in ANCOM I and Songbird, but pretty much shows up everywhere. There's also sort of an implicity assumption baked into a ( ~ y_{0}...y_{i}) model, which is that there isn't interaction between the microbes we care about. So, it doesnt matter as much if two microbes are correlated with each other, only their relationship with the independent variables in the model.

So, given the model and the outcome, there are 3 modeling approaches I can see answering what you're looking for.

Sample Classifiers

Okay, so given our as exposure, as outcome relationship, you can use a classifier to build a model that will use the microbes (and whatever else you want) to predict the outcome. (There's a tutorial for the qiime2 plugin, if its of interest.

"However, like with most things, it's more complicated than just "yes" A sample classifier usually has a step where it selects features. Some people use differential abundance ( ~ y) to pick those features that then get fed into the classifier. Those get refined through what sits somewhere between computation and Arthur C Clarke-esque magic ). You classifier will produce feature ranks, but it won't give you a nice table of coeffecients with beta, RRs or ORs and/or p-values that you might use to make your clinical collaborators happy.

So, while this is definitely an option, it may not be what you're looking for.

Predicting outcome/single microbe exposure relationships

So, if sample classifiers wont give you what you need, why not just flip to a classic regression? You could theoretically just do the CLR transform yourself, write a for-loop, and then crank through all those models and come out with an OR, RR, or beta on the other side. (y ~ , x_{0} ~ ). There are lots of libraries that will do an FDR correction. And, on a simple level, you've sloved the problem. Yay!

Except that, as long as you don't have a time to failure component in your model*, the interpretation of (y ~ , x_{0} ~ ) and (x_{0} ~ , y ~ ) shouldn't be any different if you have a single microbe you're looking at. The coeffecients won't be different, but I'm not sure it matters if cases have 2x compared to controls, or that every time you double , your odds of being a case increase. (Im not caffienated enough to do that particular math). For continuous outcomes, it's pretty much just algebra. (if y=mx+b, then x=\frac{1}{m}(y-b)).

Based on that, my recommendation is to stick with your single direction standard tools that already know your data, and just think about how you frame things in your results/interpretation/discussion.

The one exception might be if you have a list of organisms a priori that you think are keystone or you want to reproduce. You need to think carefully about how you pick those, and how things line up, but cramming them through a x_{0} ~ model might make sense and make your life better. Just, like, FDR correct your data and be clear about why you picked your list.

Predicting outcome/poly-microbial exposure relationships

A second reason you might want to flip the direction is because you assume there's the potential for some kind of interesting interaction between the microbes. They're a community, they co-exist and chill toegther and eat eachother's garbage and fight battles, and have weird genetic swap meets. Maybe you dont just need 1 microbe to predict your outcome, maybe its a community.

From a classic regression standpoint, now you have two problems: the potential for co-linearity and combinatorial space. My 1 microbe/1 model thing is suddenly out the window as soon as I start adding the microbes together into multiple models. Even if I do like a ranked-choice thing with AICs and only add models were my fit is improved, the whole thing rapidly becomes a complicated model selection problem. You'll burn through false discovery space left, right, and center, and the whole thing will suck if you're using a p-value threshhold. A second issue is how your modeling gets affected if A and B are correlated (or anti-correlated) and not just for compositional reasons. Again, in the big weird world of microbes being a communitiy, there are all sorts of reasons why the two might show up together and that could make your modeling hard.

My recommendation in this case would be to use a polymicrobial technique. I happen to like Songbird or modified versions of Songbird to construct ALRs. There's a new machine learning package in R that will pick the best ALR features for you (although I dont remember the name) or things like phylofactor if you want to bring in phylogeny. However, in all fo thes cases, the equation will still be x ~ and y ~ {A, B, ...} because that's how the compositional modeling space works.

The nice thing is that most of these can be turned into a single univariate statistic which actually makes for really nice forward modeling in the future.

Conclusions

Okay, so having spent way too much time thinking about this over the past couple of days, what are the take aways?

1. For the most part* the mathematical expression of the relationship between outcome and exposure and the actual relationship between outcome and exposure are pretty malleable. At the end of the day, you're interested in whether or not there's a relationshp between two things, and you want a model that tests for that relationship. How the relationship fits into the broader causal paradigm depends more on the way you set up your experiment/interpret yoru results than on the actual model you use.
2. Its worth thinking through what assumptions you're putting into your models and tools around community.
3. If you have to, there are often in-elegent brute force solution , but you might miss something important.
4. Microbiome data is still complicated and fun.

Best,
Justine

Footnotes

*Survival/time to failure models are slightly different than other standard regressions because they're you're modeling two distributions. There's a distribution associated with time and a second distribution associated with the outcome. Because it's a two component system, you can't just flip the direction of the equation to get the interpretation.

8 Likes

As an epidemiologist myself, your language is spot on

Kudos

1 Like

A follow-up question regarding your suggested methods: As microbiome data is complicated enough, longitudinal analysis make metters worth

You proposed the rational solution of CLR transformation. How would you approach this if, for example, each participant has two samples (time 1 and time 2) and you would like to model the change for each feature as the independent variable in the regression? One solution that might seem possible is a simple subtraction of CLR values (CLR time 2 - CLR time 1) for each feature and each participant. Another possible solution might be Ln(relative abundance time 2 / relative abundance time 1), but this obviously does not account for compositionality. I'm interested in how you would approach this.

Longitudinal analysis also poses difficulties when using classifieirs. Even the (liberal?) methods I have proposed in the previous section may not work here, as suggested in this post Could q2-sample-classifier take relative abundance or CLR transformed abundance as input? - #2 by Nicholas_Bokulich @Nicholas_Bokulich . Is there a way to account for longitudinal changes per feature while using classifiers?

Thanks and happy to continue this discussion.

Hi @kam,

This is more of a question about what kind of model do you use for repeated measurements? I tend to use a linear mixed effects regression for this. You could theoretically also use a GEE; you should consult your local statistician for the proper way to do this modeling. Personally, it took me a little bit to work through the whole repeated measures thing (there was a like two hour phone call with a friend who's a statistician where he was extremely patient with me). I'm like 85% sure that ANCOM-BC either has or will soon have an LME component, so I'd probably recommend you just go for that.

I'm going to link you to another post that come up in the past day or two on longitudinal analysis. Full disclosure, it has some paired sample analysis I did for the project which includes a polymicrobial microbiome measurement I used as a a predictor in an odds ratio model. (My epidemologist was so happy with me!)

Best,
Justine

1 Like