Corrected p-values using gneiss lme-regression smaller than uncorrected p-values

mortonjt · August 9, 2018, 4:53am

Yikes! Another post was brought up here: Gneiss Singular matrix error - #5 by ajaybabu27

This is most surely pointing to some sort of ill-defined input. Does this throw an error on just one of the inputs? And have you tried running OLS on it as a sanity check. I know it is not "technically" correct, but if it also fails on OLS, then that is a sign for a more sinister problem.

Nice article! Right, an interpretable goodness-of-fit is not too straightforward. Another nice blog post breaks this problem down here Gneiss does provide the residuals and fits, so it should be possible to rig your own goodness-of-fit. But this is definitely something that we really should have in Gneiss in the future.

willowblade:

In an effort to find a simplified version of the model that would run, I included only two variables (Section and MilkFeeding2). Section did not seem to have the p-value issue, and had the advantage of having enough difference within subject to dodge the singularity issue, while MilkFeeding2 is a variable I would like to understand that is definitely causing problems. When I run this model, the p-value problem in MilkFeeding2 disappears, but the problem seems to remain in the grouping variable. I'm guessing this means that the including all variables is overfitting the model, but I am a bit concerned as the number of balances with NaN for the p-value remains a bit high. All of which is bringing me back to questions on the best way to test the model... FTS_regression_8.qzv (1.8 MB)

Right, this is tricky and unfortunately there isn't really a right answer that I'm aware of at the moment. If I had to guess where the source of NaNs is coming from, it is likely because there aren't enough samples within the subgroups -- you need at least 3 samples for each cross-section. That means for a given milk type, and a given section for a given individual, you need at least 3 values for your particular balance have non-zero variance. If you don't have many samples, this actually can be quite a stringent criteria.

The rule of thumb I use when it comes down to filtering species is measuring your degrees of freedom. If you have 2 categorical variables you have 2 degrees of freedom. If you have a bunch of microbes that are only observed in 2 samples, you can have a near perfect fit for those microbes, so whatever inference you perform on them is not useful (because you model will not have the resolution to measure them). At a bare minimum, I would recommend counting all of the variables in your formula, and using that as the baseline for filtering. Although be careful, categorical variables with D categories actually have D-1 degrees of freedom and therefore count as D-1 variables.