Gneiss ols regression model fitting

Dorothy · February 14, 2018, 1:57pm

Hi,

I read your paper "Balance Trees Reveal Microbial Niche Differentiation" with great interest and would like to try it out with my own data.

I have 110 samples but 22 covariates (after filtering out highly correlated ones), I'd like to ask how to select best subsets of the covariates to fit the model?

best regards,
Dorothy

mortonjt · February 25, 2018, 7:33pm

Hi @Dorothy, sorry about the late reply.

There isn't a standard protocol for filtering features, but we recommend filtering out features that don't contain much information.

By that I mean

Features that have few reads (i.e. less than 10 reads across all samples).
Features that are rarely observed (i.e. present in less than 5 samples in a study).
Features that have very low variance (i.e. less than 10e-4)

Of course, these numbers will fluctuate based on the number of samples present in the study.

Dorothy · February 26, 2018, 9:51pm

Thank you for your reply. I think you misunderstood my question. What I wanted to ask is when there are multiple environmental factors measured (in my case 22 variables), how should I do the selection of the independent variables/enviromental factors/covariates to fit the regression model?

In the beginning, I fit the model with all the covariates which I think they may influence the microbiome. After I saw that many of them do not explain much of the variance, I tried to fit with one variable first then add one by one and kept those at least explain 2% of data variance.

Do you think I am doing it right?

mortonjt · February 27, 2018, 9:35pm

Ahh.... ok covariates, not features.

The covariates can be selected using the regression summaries when you run the full model. We do a leave-one-variable-out cross validation approach, where we evaluate the change of R^2 when each variable is left out. The variables that change the R^2 the most are the ones you are probably interested in the most.

That being said the approach that you are doing is completely valid. May be worthwhile reading up on relative importance analysis (more information can be found in the R package relaimpo). Basically you are proposing the strategy that they refer to as first, whereas we implemented the last strategy. Both are heuristics (this boils down to the feature selection problem which is difficult), and may be smarter ways to go about this.

tl;dr both strategies are valid. We have an implementation that helps select for covariates, but there is definitely room for improvement.

system · March 31, 2018, 3:41am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.