Hi all!
I have a bioinformatics background, but I'm fairly new to microbiome analysis. I'm working with a longitudinal 16S dataset, and I've run into an analysis hurdle.
I've read a lot of forum posts (on this forum and elsewhere), searched the literature (e.g., here, here, and here), and talked to 3 people with a lot of microbiome analysis experience--and I'm still somewhat stumped regarding how to move forward.
Dataset: I have a longitudinal dataset with 3 timepoints (visit 1, visit 2, visit 3). 16S rRNA sequencing was performed at each timepoint for each subject. I have a variety of continuous and categorical clinical variables that we also measured at each visit.
Goal: I want to associate changes in specific taxa over visits (independent variable) to concurrent changes in other variables (dependent variables). E.g., to relate changes in taxa abundance from visit 1 to visit 2 (independent variable) to the change from visit 1 to visit 2 in resting heart rate (dependent variable). I want to do this for all my taxa (hundreds of individual statistical tests), similar to other exploratory approaches like differential relative abundance analysis.
Past approach with non-16S variables: On a much smaller scale, with variables other than the 16S data, I have been using linear regression models. I'm particularly interested in changes in variables between visit 1 and visit 2 (change between visit 1 and visit 2 will be indicated here as V1V2).
As an example for this post, let's suppose that I'm interested in V1V2 resting heart rate (dependent variable) and V1V2 sleep quality (independent variable). Using R's notation, I could set up a model like this to ask whether V1V2 sleep quality relates to V1V2 resting heart rate:
lm(heart_rate ~ v1_heart_rate + v1_sleep_quality + V1V2_change_sleep_quality, data=outcomes_V2)
where the data frame that I pass in (outcomes_V2
) has been filtered to only have rows for visit 2, and each row has a column representing the visit 1 heart rate (v1_heart_rate
) and the visit 1 sleep quality (v1_sleep_quality
), along with a column for V1V2_change_sleep_quality
, which was directly calculated in my script.
A simplified version of outcomes_V2
with only 2 participants might look like this:
participant_id visit heart_rate v1_heart_rate v1_sleep_quality V1V2_change_sleep_quality
1 2 100 110 5 2
2 2 70 90 8 -1
This set up allows me to ask whether a V1V2 change in an independent variable is associated with a V1V2 change in a dependent variable.
Current problem: I know that longitudinal microbiome data presents some unique statistical challenges, and benchmark papers (e.g., here) suggest that statistical approach particularly matters with this kind of data to avoid an inflated FDR. I have been unable to find a tool (designed for 16S data) that allows me to control for visit 1 relative abundance of taxa nor one that allows me to use calculated changes in relative abundance between timepoints (i.e., V1V2_taxa_change
) as independent variables (testing all taxa individually). Ideally, I'd like to be able to do both of these things in the same tool.
ANCOM-BC allows users to associate specific taxa with continuous metadata variables, as shown in the documentation:
I guess that the short version of what I'm looking for is the ability to identify taxa like this, but instead of associating taxa relative abundances at a time point with a continuous variable at that same timepoint, I want to associate changes in taxa with changes in a continuous variable.
Specific questions:
1) Are there microbiome-specific tools in QIIME 2 or elsewhere that allow users to control for baseline (i.e., visit 1) relative abundance of taxa with longitudinal data? Or that let users work with changes in relative abundance as an input variable? I imagine that a tool doing both of these things would essentially run tests like this for each taxa: v2_outcome ~ v1_outcome + v1_taxa + V1V2_taxa_change
.
I've looked at the longitudinal plugin here, and it doesn't seem to quite do what I'm interested in--but I would love to be corrected if I'm mistaken.
2) If there are not existing tools that can do this, are there any statistical approaches that will allow me to get similar answers, even if it is not modeled exactly the same way?
One idea I had for this second question was to use existing tools to find V2 taxa that relate to V2 continuous variables (e.g., using ANCOM-BC) and then use existing tools to see if any important taxa from that analysis also happen to change in abundance from V1 to V2 across all our participants. This isn't a perfect approach, though.
Thank you for your time and help!
Jeremy