A couple of questions about Gneiss and model overfitting

Dear Qiimers,

I am moving my first steps with Gneiss, and I am trying to get more familiar with the results. I might need some help with interpreting the output:

  1. Looking at the models summary, mse vs pred_mse: as I understand it, we are comparing the mse from the model on the training dataset (random 90% of the data) vs the test dataset (leftover 10% of the data). If the model is overfit, the error in the predictions will be larger than the mse on the model. My question is: how good is good enough? I.e, should the pred_mse be in the order of 1/10th of the other? What if the two values are about the same ? Is there a ratio between the two that can be used as a rule of thumb to judge over/underfitting?

  2. Comparison between the two plots: “projected predictions” and “projected residuals”. In the first, I can eyeball if the predicted values are a reasonable representation of the real data; in the second, by comparison with the first, I can check if the residuals are in the same order of magnitude of the predictions (= not good; that would mean large random error and scarce predictive value of my model, like it appears on the tutorial dataset, where Rsquared=~0.11). Is my interpretation correct?

Thank you for your kind attention,
Max

  1. pred_mse just needs to be equal or smaller to mse. Otherwise there is a good chance of overfitting
  2. Remember you are predicting the abundances of an entire microbiome community - so that R^2 means that you are able to explain 11% of all of the variance in that community.

Thank you, @mortonjt ,

For your kind answer. Just to clarify no. 2: Is my interpretation of the two scatter plots correct? Please let me know if my question is not clear and I need to add further details.

Bear with me, I’m a noob :roll_eyes:

Yes, your interpretation for question 1 is correct :slight_smile: