Gneiss balances: Zero values for model mse

Hi all,

I performed a gneiss regression analysis on 10 samples with qiime gneiss ols-regression. Weirdly, the model_mse comes out as zeros, where as there are pred_mse values (see screenshot). The regression coefficients summary plot is empty. I would like to check for "over-fitting" and wanted to compare model_mse to the predicted values. @mortonjt any idea of what is going wrong? Could you also please explain what it means for my data set if over-fitting is happening?

Thanks a lot!!

-steffen

That is interesting. Is the MSE is zero, I’d suspect that overfitting is occurring. An R^2 of 0.91 is also incredibly high for an ols regression.

A couple of comments.

  1. What do the PCoA plots look like? The R^2 should be comparable to the total variation explained in the top 3 axes (since it is just another measure of explained variance). If your PCoA has very high % explained variance, then maybe these results are sane.
  2. What does the distribution of the data look like? When you run qiime feature-table summarize, there are many low abundant features?

Also as a rule of thumb, you want to have the number of covariate to be about 10% of your samples. If you have 10 samples, and you are trying to fit 10 covariates, then you will definitely get perfect fits and overfit the data. If you only have 10 samples to begin with, you should probably only try to fit with 1 covariate.

2 Likes

Thanks for the comments @mortonjt.

What PCoA plots do you mean? Bray Curtis, jaccard, weighted/unweighted unifrac? I am also guessing you suggest to check the bar plot on the right side of the emperor plots (under axes), is that right?

I can already tell you that I do have a high fraction of low abundant features based on what I have seen from my data. The sample that I used for gneiss was filtered and does not have lower feature abundancies than 5.

Might be a too basic question but: How do I chose the number of covariates?

Once you let me know which PCoA plot is best to use, I can plot it and post it here for you to inspect it.

Thanks so much!
steffen

Pick your favorite abundance based distance. Bray-Curtis/Weighted Unifrac should do.

Here, you only have 10 samples. So you probably can only run a regression with a single covariate (you shouldn’t fit more covariates than 10% of your samples). I think this is where the overfitting is coming from.

I see. Does that mean in order to use 1 covariate, the parameter --p-formula would only contain one metric? Such as pH…

@mortonjt I think I got it now.

I have a couple of follow up questions though:

The red_mse value of the fold_9 row is 0.0000, why is that?

Unfortunately the regression coefficients summary plot is still empty:

Lastly, what type of plot is below this empty one, showing the predicted and raw data points on a y0 vs. y1 axis. I understand what it is showing but is there a common term for this plot?

Thanks a lot and sorry for the too simple questions!

cheers,
steffen

Interesting. This looks more like what I’d expect for linear regression (even though these are extremely small sample sizes).

Are you able to download the coefficients, residuals and pvalues on the main simplical linear regression summary? I’m entirely sure what is going on with the coefficient heatmap - but getting those values may shed some light on that.

Thanks @mortonjt I just sent you the data for checking.

1 Like

Hi @steff1088 your pvalues look sane. Are you able to zoom in the coefficient heatmaps?

We have noticed that the heatmap appears to be empty sometimes, but will look more sane once you start navigating the heatmap.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.