Clustering and regression analysis using q2-gneiss

JenKelly · May 18, 2018, 2:36pm

Dear Qiime developers,

I am trying to analyse longitudinal data using the q2-gneiss plugin. I have two sets of individuals each consuming a different diet over 5 time points. I am trying to assess the change in microbial composition (using balances) due to diet over time. I have a few questions regarding the clustering and regression steps.

Clustering: I do not understand when it is more appropriate to use gneiss correlation-clustering vs. gneiss gradient-clustering. For my data, I have just used correlation clustering, however if my I am trying to find compositional differences between diets should I use gradient clustering with "diet" as my clustering category? If not, please could you explain when gradient clustering should be used.

Regression analysis: As I want to add "individual" as a random effect I want to use gneiss lme-regression rather than gneiss ols-regression. I am not clear what syntax to use to define the formula if I want to assess the compositional change associated with diet over time adding individual as a random effect. Also should I define "--p-groups" as "timepoint and diet"?

I am pretty new to lme modelling so any advice regarding the correct way to analyse this set of data would be much appreciated.

Thanks very much for your help!
Jen

mortonjt · May 18, 2018, 3:46pm

Clustering : correlation-clustering may make the most sense here. Its a little weird to try to cluster by diet (how do you cluster by discrete categories)?

Regression : I would try to set the formula and groups variables as follows

--p-formula diet

--p-groups timepoint

That should give you the random effects for each individual, while testing for diet changes.

JenKelly · May 23, 2018, 12:50pm

Hi @mortonjt

Thank you for your advice.

I have attached the output of running gneiss lme-regression using formula and groups as you suggested. LME_Milk_Replacer_ID_timepoint.qzv (2.4 MB)

I am having trouble understanding how to interpret this output? There are no R2 values or cross-validation outputs to estimate the model fit? Regarding the "projected prediction" scatterplot, all my samples are clustered into one spot, could you explain what this means? Also, as there are so many balances, the heatmap is not interpretable, is there any solution to this?

Is my next step to identify significant balances from "Coefficient pvalues" csv file and then run 'qiime gneiss balance-taxonomy' on each one to understand how the balances are changing between diets? If this is the case there are 82 significant balances, which seems a lot to check every single one (particularly if I want to look at multiple taxonomic levels). Is there another way I can narrow down potentially interesting balances?

Finally, is there any way to visualise the effect of diet on balances over time?

Thanks again,
Jen

JenKelly · May 23, 2018, 12:50pm

Hi @mortonjt

One last question I forgot to put in my reply. You said the following:

Regression : I would try to set the formula and groups variables as follows

--p-formula diet

--p-groups timepoint

That should give you the random effects for each individual, while testing for diet changes.

How does the model know which metadata category to add random effects for, as there is no option to supply a metadata category to represent individual id.

Thanks again,
Jen

mortonjt · May 23, 2018, 4:21pm

Hi @JenKelly, good catch. --p-groups should correspond to the host_subject_id, not the timepoint (sorry about the confusion). An example of how to use this can be found in the cfstudy example here

R2 and cross validation is super weird with respect to LME, and it is not obvious how to do it. There is a pretty cool article here that describes some of the challenges.

When there are that many balances, you could directly download the pvalues. I'd hazard against looking at every single balance. From that list of balances, I'd start by trying to sort through the balances based on p-value, coefficient size and size of balance. Note that balances with a large number of individuals will likely give a more interesting signal compared to balances with few individuals, since it would explain a larger proportion of the community.

Concerning visualizing balances over time, there isn't a mechanism to explicitly do this in qiime2. We do have an example on how to do this in python here (an equivalent thing could be done in R as well). Agreed, it would be great to have this available in qiime2 in the near future.

JenKelly · May 24, 2018, 8:17am

Hi @mortonjt

Thanks for the reply!

I'd spotted that cfstudy tutorial earlier, however it does not explain how to include time in model, would I add that to the formula? i.e as follows

--p-formula diet+timepoint

--p-groups host_subject_id

Although I do want to add host_subject_id as a random effect, I did run a linear regression on the data just as an initial investigation. This allowed me to evaluate the effect of each covariate by looking at R2diff, and I found that as expected, timepoint and host_subject_id explained the most variance in the balances. Is this still a valid approach to understand the contribution of each individual covariate, or does this not make sense as it's a longitudinal study with random effects?

Thanks again for your advice, it's extremely helpful!

Jen

mortonjt · May 24, 2018, 6:26pm

That should be correct -- in the cfstudy, the pH was a proxy for time (since there are multiple repeated measures with respect to pH).

Let me know if you have any more questions!

JenKelly · June 12, 2018, 9:29am

Hi @mortonjt,

Thank you for your previous replies.

Would you mind commenting on my previous question about whether using ols modelling is appropriate with longitudinal data.

Although I do want to add host_subject_id as a random effect, I did run a linear regression on the data just as an initial investigation. This allowed me to evaluate the effect of each covariate by looking at R2diff, and I found that as expected, timepoint and host_subject_id explained the most variance in the balances. Is this still a valid approach to understand the contribution of each individual covariate, or does this not make sense as it’s a longitudinal study with random effects?

Also, when I get the output of running gneiss lme-regression with the following

–p-formula diet+timepoint

–p-groups host_subject_id

I don't really understand how the continuous and categorical covariates put into "--p-formula" interact, and whether this formula will actually identify significant balances that are effected by diet over time, or just those effected by either diet1, diet2, or individual timepoints. For example when I look at the coefficient and p-value output csv files (from lme-regression) like you suggested, each covariate (diet2, timepoint2, timepoint3, timepoint4) ect.. has separate coefficient and p-values. Does that mean each covariate is being treated separately, rather than testing for any interactions?

One last thing! I came across a post talking about "Correct p-values" when looking at output of gneiss regression (Using ANCOM and gneiss with one categorical variable - #7 by mortonjt).

Careful - you’ll want to only consider those pvalues under Corrected Pvalues. And pvalues close to 0.05 will also warrant skepticism, so make sure to carefully sanity check the boxplots / scatter plots in the balance-taxonomy command.

I don't see a "Corrected Pvalues" column, therefore I just wanted to check that the pvalues I download from "Coeficcient pvalues" from the regression output are in fact already corrected?

Thanks again and sorry for my confusion!
Jen

mortonjt · June 12, 2018, 9:41pm

Hi @JenKelly,

Using straight up linear regression is ok with time series. The means should be the same, but adding in random effects can help with modeling variances and can identify less obvious signals.

Regarding the second question, since there are no interaction terms specified, the those effects will just be additive effects.

If you wanted interaction terms, you can do --p-formula diet*timepoint

Unfortunately the changes for corrected pvalues didn't make it in the last release. We just pushed out another minor release of gneiss, so that should be fixed in the next iteration. In the meantime, the corrected pvalues should be done manually.

JenKelly · June 13, 2018, 8:11am

Hi @mortonjt

Thanks for the response, it was very helpful, I will try out interactions formula and see what I get.

Can I check that uncorrected p-values are supplied for both gneiss ols-regression and gneiss lme-regression, and therefore have to be adjusted by multiplying by number of tests?

Thanks again,
Jen

mortonjt · June 13, 2018, 7:15pm

For Boniferroni yes, you can just multiply by the number of tests. The release is up, so a reinstall of q2-gneiss should just yield the updated webpage