I am super new with Qiime2 microbiome bioinformatics platform and have been introduced to qiime2 a few months back by one of my friends. I do love qiime2 a lot and thank you for developing the platform. Do forgive me if some of the questions are too easy for experts in qiime2.
I have a few quick questions related to gneiss:
a) For the current study, we have a total of 20 different soil parameters (chemical and physical analyses) and with the total number of observations: 50. Would like to know is that possible for me to perform correlation-clustering and ols-regression to find out the soil parameters (or covariates) with the highest R2 diff (as well as low corrected p-values: <0.05) prior to adopting gradient-clustering for those covariates (for instance, pH) that are contributing to the variation in soil microbiomes?
b) In one of the trial runs using gneiss, all the soil parameters (both chemical and physical analyses) were used to compute linear regression summary (through ols-regression after correlation-clustering), Rsquared of more than 0.600 was achieved. Unfortunately, all the pred_mse values (from fold 0 to 9) were higher than model_mse. I then tried to run again with only soil chemical parameters (16 covariates and 50 observations). pred-mse values (ranging from 7.5 to 22) ranging from are now lower than model-mse (ranging from 22 to 26). I am still thinking whether to further reduce the number of soil chemical parameters (or covariates) and re-run with ols-regression. Would like to know is there any minimum ratio (pred-mse to model_mse) required or we should achieved prior to balance-taxonomy analyses? Any maximum number of covariates allowable for ols-regression?
It’ll help to paste an image of the diagnostic plots.
My recommended differential abundance workflow has changed significantly since the development of gneiss. See our paper here. Also see songbird and qurro
Based on the summary, pH appeared to have the highest R2diff, would it be possible for me to re-run with gradient-clustering on the pH prior to downstream analyses (with balance-taxonomy)?
Thank you for the latest differential abundance workflow and your most recent works. I will also go through the paper, songbird, and qurro at soonest as well.
Hi @yitkheng - the R^2 suggests that this maybe a good fit.
I can’t comment on the ratio, that is typically used as a guide for cross-valiation (you don’t want pred-mse to be much higher than model_mse).
I also want to point out that there has been a number of findings regarding differential abundance that has happened since gneiss has been initially developed. I recommend reading this paper and checking out either songbird and aldex2 since they may provide a much easier way to provide interpretable features.