Linear Mixed Effect Intrepretation Help

Hi - i’m going through the Parkinsons tutorial and reached the last part in which a mixed linear effect model is applied which in the context in which its applied from my understanding is that its testing if there is a relationship between donor and genotype in affecting fecal community (beta diversity) over time with the initial baseline time point of day 7

qiime longitudinal linear-mixed-effects
–m-metadata-file ./metadata.tsv
–m-metadata-file ./from_first_unifrac.qza
–p-metric Distance
–p-state-column days_post_transplant
–p-individual-id-column mouse_id
–p-group-columns genotype,donor
–o-visualization ./from_first_unifrac_lme.qzv!

i just need a little help with the interpretation of results (attached)

Are these regression scatterplots essentially showing the variation in fecal community in susceptible vs wild-type and hc (healthy) vs pd (parkinsons mice) - with the solid lines as the group mean of diversity and data points as individual beta diversity of samples - with the shaded area being the average overall variation of each group type at specific time points?

I’m not sure how to interpret the project residual plot either - is this like an ANOVA plot? so is it looking at variation in data between different groups - I’ve read that data point should roughly be centered around 0 and if they are lower it indicates lower variation compared to the mean variation of the group and if placed higher than it is higher than the mean and if the plots are not centered then its a poor model? (not sure what poor model means either really) - if this is correct then why is this useful to know? / how would you integrate this into analysis?

Also i’m a little confused with the model results its my understanding that genotype acts as the independent variable and donor + distance (days_post_transplant) would be pitted against genotype to determine if theres a relationship between these dependent variables and genotype in eliciting change in beta diversity of feces - however not all combinations are explored e.g. genotype[wild type]:donor[pd_1] are looked at but not susceptible genotype and wild-type donor etc - i.e. not all combinations of metadata categories are explored to determine if there is a relationship between these variables

so for one of the questions in the tutorial it asks - is there a significant association between genotype and temporal change?
looking at days_post_transplant[T.wild-type) the P value is <0.05 indicating there is - but only Wild-type is considered and not susceptible - so how would i answer these types of questions?

lastly i’m assuming i just look at the p value to determine significance but are there any other values of importance e.g. z value and the others 0.025 and 0.975 - are these important and what do they represent ?

Any advice is much appreciated - thank you

Hi @Tohseef,

This is a big question! Let’s try to break it down. You may also find it useful to review linear regression and interpretation if its been a while (a good, basic stats book might be helpful) or talk to a statistician. I’ve also found this post on Towards Data Science and the parameters section of this post about Stata results helpful at looking at interpretation, but a lot of your questions come down to some regression modeling basics, and there are a lot of really smart people who have invested huge amounts of time into building material teaching these things.

I think the shaded area may be the standard deviation as opposed to the variation, but yes, this is correct.

I’m going to refer you to this article on residual plots, generally. Once again, you have your mean as the solid line, the standard deviation as the shaded area, and the actual values.

Some of this has to do with the way the model is coded here. You could explore the documentation of this particular function to get better control over the modeling.

For this model, and most categorical regression models, you’re working against a reference group. Your reference group here is a susceptible mouse from the healthy donor at some theoretical time 0. So, on average, there’s a within-group distance of 0.248 [95% CI 0.101, 0.396].

Then, the genotype[T.wild type] term asks how much the intercept of your line changes if you have a wild type mouse compares to a susceptible mouse if you hold everything else constant. (Your distance increases by 0.265 [95% CI [0.047, 0.465], which is significant.)

Then, the interaction term, genotype[T.wild type]:donor[T.pd-1], tells us how the slope changes when you’ve got the pd-1 donor, compared to when we held things constant, so here, we find that there’s a decrease of -0.425 [95% CI -0.723, -0.126] distance units compared to the genotype[T.wild type] term alone (which is basically genotype[T.wild type]:donor[T.hc-1] because of the way we hold things constant.)

This expands out with the terms.

Now that we’ve talked through the terms and you’ve got some resources, Im going to bounce this one back to you.

In your table, you have

Values Definition
Coef The slope (for continuous variables) or intercept (for categorical) that describes the difference in the value.
Std.Err. The error in that slope measurement
z the parametric test statistic that gets used to calculate your p-value. You maybe want to look at t-tests and f-tests for a better sense of this (although it’s a different distribution)
P>abs(z) the frequentist p-value giving you the probablity that the value is significantly different at some critical level. This essentially is a representation of the probability that the z-value associated with your data is more extreme than X of z-values in a given distribution.
[0.025 & 0.97] The lower and upper limits of the 95% confidence interval, which is useful for describing the error and estimate around your results. Look for info on effect size representation for more information about why this matters and why you should probably be presenting it.

Best,
Justine

4 Likes

ahh thank you for taking the time to explain - its clearer to me now.
I’m new to qiime and generally terrible at math - i definitely need to read up on some statistics to get a better overall understanding of these stats tests

thank you!

1 Like

Hi @Tohseef,

Happy to help! And trust me, I think it takes everyone some time to get up to speed on this stuff - there are lots of models, assumptions, limitations, terms and displays to try and figure out!

We’re happy to keep trying to answer your questions here and there are plenty of good online resources that talk through these concepts and walk you through - often without ever really talking about the underlying mathematical magic :sparkles:

Best,
Justine