I read your paper “Balance Trees Reveal Microbial Niche Differentiation” with great interest and would like to try it out with my own data.
I have 110 samples but 22 covariates (after filtering out highly correlated ones), I’d like to ask how to select best subsets of the covariates to fit the model?
Thank you for your reply. I think you misunderstood my question. What I wanted to ask is when there are multiple environmental factors measured (in my case 22 variables), how should I do the selection of the independent variables/enviromental factors/covariates to fit the regression model?
In the beginning, I fit the model with all the covariates which I think they may influence the microbiome. After I saw that many of them do not explain much of the variance, I tried to fit with one variable first then add one by one and kept those at least explain 2% of data variance.
The covariates can be selected using the regression summaries when you run the full model. We do a leave-one-variable-out cross validation approach, where we evaluate the change of R^2 when each variable is left out. The variables that change the R^2 the most are the ones you are probably interested in the most.
That being said the approach that you are doing is completely valid. May be worthwhile reading up on relative importance analysis (more information can be found in the R package relaimpo). Basically you are proposing the strategy that they refer to as first, whereas we implemented the last strategy. Both are heuristics (this boils down to the feature selection problem which is difficult), and may be smarter ways to go about this.
tl;dr both strategies are valid. We have an implementation that helps select for covariates, but there is definitely room for improvement.