Associate changes in taxa relative abundances (for all taxa) with calculated changes of continuous metadata variables

bioinfx · August 22, 2024, 4:38pm

Hi all!

I have a bioinformatics background, but I'm fairly new to microbiome analysis. I'm working with a longitudinal 16S dataset, and I've run into an analysis hurdle.

I've read a lot of forum posts (on this forum and elsewhere), searched the literature (e.g., here, here, and here), and talked to 3 people with a lot of microbiome analysis experience--and I'm still somewhat stumped regarding how to move forward.

Dataset: I have a longitudinal dataset with 3 timepoints (visit 1, visit 2, visit 3). 16S rRNA sequencing was performed at each timepoint for each subject. I have a variety of continuous and categorical clinical variables that we also measured at each visit.

Goal: I want to associate changes in specific taxa over visits (independent variable) to concurrent changes in other variables (dependent variables). E.g., to relate changes in taxa abundance from visit 1 to visit 2 (independent variable) to the change from visit 1 to visit 2 in resting heart rate (dependent variable). I want to do this for all my taxa (hundreds of individual statistical tests), similar to other exploratory approaches like differential relative abundance analysis.

Past approach with non-16S variables: On a much smaller scale, with variables other than the 16S data, I have been using linear regression models. I'm particularly interested in changes in variables between visit 1 and visit 2 (change between visit 1 and visit 2 will be indicated here as V1V2).

As an example for this post, let's suppose that I'm interested in V1V2 resting heart rate (dependent variable) and V1V2 sleep quality (independent variable). Using R's notation, I could set up a model like this to ask whether V1V2 sleep quality relates to V1V2 resting heart rate:

lm(heart_rate ~ v1_heart_rate + v1_sleep_quality + V1V2_change_sleep_quality, data=outcomes_V2)

where the data frame that I pass in (outcomes_V2) has been filtered to only have rows for visit 2, and each row has a column representing the visit 1 heart rate (v1_heart_rate) and the visit 1 sleep quality (v1_sleep_quality), along with a column for V1V2_change_sleep_quality, which was directly calculated in my script.

A simplified version of outcomes_V2 with only 2 participants might look like this:

participant_id     visit     heart_rate      v1_heart_rate     v1_sleep_quality     V1V2_change_sleep_quality
1                     2         100                 110                    5                                2
2                     2          70                  90                    8                               -1

This set up allows me to ask whether a V1V2 change in an independent variable is associated with a V1V2 change in a dependent variable.

Current problem: I know that longitudinal microbiome data presents some unique statistical challenges, and benchmark papers (e.g., here) suggest that statistical approach particularly matters with this kind of data to avoid an inflated FDR. I have been unable to find a tool (designed for 16S data) that allows me to control for visit 1 relative abundance of taxa nor one that allows me to use calculated changes in relative abundance between timepoints (i.e., V1V2_taxa_change) as independent variables (testing all taxa individually). Ideally, I'd like to be able to do both of these things in the same tool.

ANCOM-BC allows users to associate specific taxa with continuous metadata variables, as shown in the documentation:

I guess that the short version of what I'm looking for is the ability to identify taxa like this, but instead of associating taxa relative abundances at a time point with a continuous variable at that same timepoint, I want to associate changes in taxa with changes in a continuous variable.

Specific questions:

1) Are there microbiome-specific tools in QIIME 2 or elsewhere that allow users to control for baseline (i.e., visit 1) relative abundance of taxa with longitudinal data? Or that let users work with changes in relative abundance as an input variable? I imagine that a tool doing both of these things would essentially run tests like this for each taxa: v2_outcome ~ v1_outcome + v1_taxa + V1V2_taxa_change.

I've looked at the longitudinal plugin here, and it doesn't seem to quite do what I'm interested in--but I would love to be corrected if I'm mistaken.

2) If there are not existing tools that can do this, are there any statistical approaches that will allow me to get similar answers, even if it is not modeled exactly the same way?

One idea I had for this second question was to use existing tools to find V2 taxa that relate to V2 continuous variables (e.g., using ANCOM-BC) and then use existing tools to see if any important taxa from that analysis also happen to change in abundance from V1 to V2 across all our participants. This isn't a perfect approach, though.

Thank you for your time and help!

Jeremy

MARTIN_MIHULA · August 28, 2024, 8:36am

Hello,

You can check a documentation of these two tools: MetagenomeSeq and MaAsLin2. Both of them have an option how to deal with longitudinal data. I think they are more appropriate for what you are trying to do than ANCOM-Bc.

Good luck!

jwdebelius · August 28, 2024, 1:08pm

Hi @MARTIN_MIHULA and @bioinfx,

So, a few thoughts, for what they're worth.

First, this sounds like an interesting and complex experiment!

Your goal doesn't align with the currently avalaible tools. I think we can accomplish the same thing, but you have to adjust your theoretical framework. I wrote about this a while ago when someone was struggling with relationships in their DAG.

Because measures are contemporaneous, you have no way of knowing which came first: the change int he microbiome and the change in the overall physiology. You have a guess, but you don't know for sure. So, switch you model and remember that in a linear regression if y = mx + b, x = (y - b) / m.

This is especially true at this stage, where your p-value should be reasonably close regardless of the modeled relationship and you're essentially using the p-value as a cut-off for looking at your data.

This is basically a modeling problem. Its not an easy modeling problem, but its a modeling problem. You essentially want to model this as y ~ m*t which expands to y ~ m + t + m:t
where

y is your dependent variable,
m is the change in y for each unit change in m when t = 0 - or the effect of your independent variable at baseline if t is time.
t is the change in y for each unit change in t when m = 0 - or the effect of time alone on your depepent variable
m:t is the amount y changes for each unit change in m and t - how much does m change for over each time step.

The way you put the variables int he model - whether you treat t as continous or categorical, etc can determine how these predictors get put together. But, ultiamately, what you want is an interaction model. I'd find a statistican to work with on this one, if you have one avaliable, and youre still fuzzy.

This is exactly what the longitudinal plugin does for alpha and beta diveristy - again, flipping the direction because the direction of the equation matters less than you want it to.

Tool wise, I find @MARTIN_MIHULA's proposal interesting:

ANCOM-BC 2, in R has the linear mixed effect framework with inmpeleted. Its just not avaliable in QIIME 2. Im not familiar with these, although Im not sure if thye use appropriate underlying models. I know the p-values in MaAsLin are a bear to work with. I would double check Table 1 from Microbiome differential abundance methods produce different results across 38 datasets which unfortunately does not include ANCOM-BC but will give you details about hte others.

Best,
Justine

bioinfx · August 29, 2024, 3:24pm

Hi @MARTIN_MIHULA,

Thanks for taking the time to weigh in on this! I'll have to dig into MetagenomeSeq and MaAsLin2. Thanks for the suggestions!

Jeremy

bioinfx · August 29, 2024, 3:24pm

Hi @jwdebelius,

Thank you SO MUCH for your detailed response. Incredibly kind of you.

I've read the DAG-related post you linked here--this is helpful for understanding how to think about where features belong in models (i.e., should the taxa be our outcomes or our predictors? And does this even matter a whole lot?).

I've worked with interaction models before, and I have a statistician I can bug if needed as well. Thanks for the lead there.

For my benefit and the benefit of others who see this later (if I'm understanding correctly), it sounds like the tldr of all this is that I should

Flip the way I'm thinking about modeling this and use existing tools that have the predicted outcome be the features (taxa), rather than using features (taxa) as predictors.
For a specific modeling approach, I should use an interaction model where I interact time with my covariate. I'm aware of multiple existing tools that can do that, so I should choose one that performs well based on benchmark papers, seems to be appropriate for microbiome data (e.g., acknowledges compositionally, etc.), and run with that.

Is this an accurate summary?

Thank you again!

jwdebelius · August 29, 2024, 3:47pm

Hi @bioinfx,

Thanks for the very clear summary! That is exactly what I was trying to say.

Best,
Justine