Linear Regression Summary in Gneiss

Hi Qiime2 folks,

I am super new with Qiime2 microbiome bioinformatics platform and have been introduced to qiime2 a few months back by one of my friends. I do love qiime2 a lot and thank you for developing the platform. Do forgive me if some of the questions are too easy for experts in qiime2. :wink:
I have a few quick questions related to gneiss:

a) For the current study, we have a total of 20 different soil parameters (chemical and physical analyses) and with the total number of observations: 50. Would like to know is that possible for me to perform correlation-clustering and ols-regression to find out the soil parameters (or covariates) with the highest R2 diff (as well as low corrected p-values: <0.05) prior to adopting gradient-clustering for those covariates (for instance, pH) that are contributing to the variation in soil microbiomes?

b) In one of the trial runs using gneiss, all the soil parameters (both chemical and physical analyses) were used to compute linear regression summary (through ols-regression after correlation-clustering), Rsquared of more than 0.600 was achieved. Unfortunately, all the pred_mse values (from fold 0 to 9) were higher than model_mse. :frowning: I then tried to run again with only soil chemical parameters (16 covariates and 50 observations). pred-mse values (ranging from 7.5 to 22) ranging from are now lower than model-mse (ranging from 22 to 26). I am still thinking whether to further reduce the number of soil chemical parameters (or covariates) and re-run with ols-regression. Would like to know is there any minimum ratio (pred-mse to model_mse) required or we should achieved prior to balance-taxonomy analyses? Any maximum number of covariates allowable for ols-regression?

Thank you in advance.

Looking forward to learning more.


1 Like

Hi @yitkheng, glad you are having some successes.

A couple of comments

  1. It’ll help to paste an image of the diagnostic plots.
  2. My recommended differential abundance workflow has changed significantly since the development of gneiss. See our paper here. Also see songbird and qurro

Hi @mortonjt, glad to hear from you. :slightly_smiling_face: My apologies for the delayed in responding.

Together with this post, a copy of the regression summary (generated with correlation-clustering's tree) has been attached for your reference.

Based on the summary, pH appeared to have the highest R2diff, would it be possible for me to re-run with gradient-clustering on the pH prior to downstream analyses (with balance-taxonomy)?

Thank you for the latest differential abundance workflow and your most recent works. I will also go through the paper, songbird, and qurro at soonest as well.

Thank you.

1 Like

My apologies. :pray::pray:

The regression summary posted earlier was generated with gradient-clustering using pH.

Regression summary below was generated using correlation-clustering.

Hi @yitkheng - the R^2 suggests that this maybe a good fit.

I can’t comment on the ratio, that is typically used as a guide for cross-valiation (you don’t want pred-mse to be much higher than model_mse).

I also want to point out that there has been a number of findings regarding differential abundance that has happened since gneiss has been initially developed. I recommend reading this paper and checking out either songbird and aldex2 since they may provide a much easier way to provide interpretable features.


Hi @mortonjt,

Thank you for the suggested paper and also two other plugins. I will go through then at soonest.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.