I have samples from 3 experimental groups (KO,KO-KL and WT) along 4 intestinal regions (Duodenum, Jejunum, Ileum and Faecal). I am trying to understand which balances differ between the 3 experimental groups at each intestinal region.
I have run LME regression using host_ID as random effects, and the formula as “Intestinal_region*Experiment_group”. When I download the FDR corrected p-values, I have results for the following interactions:
I understand from a previous post ( Gneiss regression summary issue) that group 1 in each category becomes the baseline to which every other group is compared, and therefore is not listed in the output (therefore groups KO and Duodenum are not shown in my output). What I don’t quite understand is how this works when interacting covariates are compared, and what my output is showing.
For example, if I look at results for: Intestinal_region[T.Faecal_sample]:Experiment_group[T.KO-KL] and Intestinal_region[T.Faecal_sample]:Experiment_group[T.WT] results, are these showing the differences between the 3 experiment groups in faecal samples only, or is there also some comparison with baseline group Duodenum too (as this is not shown)?
Is a more appropriate approach to split my dataset my intestinal region, and run LME regression on each individual dataset using “Experiment_group” as a singe covariate?
Interactions are tricky, but your instinct is correct. The Duodenum is the baseline for the first level and KO is the baseline for the next level. You can think of these interaction terms as a union - i.e. Intestinal_region[T.Jejunum]:Experiment_group[T.WT] is looking at samples that are both in the Jejunum and WT and comparing those against those that are both in Duodenum and KO. More information about how formulas are constructed can by found here: http://patsy.readthedocs.io/en/latest/formulas.html
I think splitting up the dataset by intestinal region may be the way to go with this data, not because of the formula, but because of the potential zeros problem. There may be very few taxa shared between these regions.
Thanks for the explanation, it is certainly more clear now. So really, using the formula as “Intestinal_region*Experiment_group” is not going to tell me which balances differ at each intestinal site between experiment groups.
I did find the patsy link before I messaged, however I think it goes a bit over my head. Could you advise if there is a formula which will allow me to account (or correct) for changes due to intestinal regions, whilst allowing me to identify changes between experiment groups? (i.e. could I add “Intestinal_region” as a random effect, or is this not appropriate?).
I have run the analysis after splitting the dataset and removing any 1-sample ASVs, the only slight issue is the low n-number which may prevent me from identifying many changes, so I would preferably like to use the whole dataset, perhaps doing the analysis on faecal samples separately as they are obviously the most dissimilar.
What this will do is test for nested effects, so this will evaluate differences between Experimental_group with each Intestinal_region. But you will still have the baseline Intestinal_region and Experimental_group as discussed above.
Random effects are typically allocated for repeated measures (i.e. samples over time and space) where each individual has a random intercept. So it probably makes sense to have Intestinal_region as a random effect in your specific model.