Understanding Gneiss - questions regarding the analysis

Alex_14262 · February 28, 2018, 2:01pm

Hello

I have run the tutorial on gneiss and been trying to do the analysis on my own dataset, but got some basic questions about this analysis, as I don't fully understand it.

I am not sure I fully understand what question this analysis is trying to answer. I thought it was testing which of the factors of interest influence the relative abundance of species in samples. So essentially it's like running a multiple regression?
Following up on the previous question, what is the null hypothesis?
I guess the main chunk of the analysis is a multiple regression. However, there are no p-values/F-scores/degrees of freedom, so how does one know which variable to keep in the model and which to omit? Also, what test statistics would you one quote in a paper?
How does one interpret the prediction and residual plots? I don't really understand the relevance of a scatter plot between the first two balances, and then what I should expect to see in those plots. Also, what do the percentages in brackets represent?
Finally, is the tree presented in the analysis representative of how the taxa are phylogenetically connected?

Thanks

colinbrislawn · February 28, 2018, 6:01pm

Hello Alexandru,

You have probably found these resources already, but I wanted to post them here for future users.

Intro video to this method: https://www.youtube.com/watch?v=HAULM1WQkew
Gneiss / balance tree tutorial: Differential abundance analysis with gneiss — QIIME 2 2018.2.0 documentation
And finally, the paper itself: http://msystems.asm.org/content/2/1/e00162-16

Colin

mortonjt · February 28, 2018, 8:01pm

Hi @Alex_14262,

Yes, the OLS method in gneiss is essentially a GLM -- we're performing something very similar to multinomial regression, but using the ilr transform instead of the alr transform.

There's a few null hypotheses - a global null hypothesis and local null hypotheses. The global null hypothesis is does the overall fit explain the data? For now, we have a measure of R^2 to handle this, but it would be nice to have a global F-test / pvalue to evaluate the overall fit -- just haven't gotten around to it yet.

The local hypothesis evaluates the likelihood of the coefficient of a balance being zero. If the slope is close to zero, then that balance isn't very explanatory for the particular covariate of interest.

See here on a recent discussion on variable selection.

Those scatter plots were originally designed to give a high level overview of the overall fit with the top two balances. The percentages represent the variance explained as explained in the tutorial @colinbrislawn linked.

The tree is what you passed in during hierarchical clustering -- however phylogenetic trees can be passed in as soon as this issue is resolved

Alex_14262 · March 1, 2018, 11:27pm

Hi,

Thanks so much for your help! I read the paper "Balance Trees Reveal Microbial Niche
Differentiation" (thanks for the link @colinbrislawn) and I understand it better now, but got a couple more questions.

So then how would you know when to reject/accept the null hypothesis? In the tutorial, R2=0.11 is considered good enough, but in my experience as a student so far, that has been considered as a very low R2 value when running regressions/ANOVAs. Are low R2 values more acceptable when running analyses on microbial differentiation?

That being said, if the R2 diff is 0, does that mean that the model would do the same with or without that covariate? For e.g. I have a covariate with R2=0.15, and R2 diff=0.00, and another covariate with R2=0.13, and R2=0.015 - is the second more important, despite having a smaller R2?

So ideally you want to see as much overlap as possible between red and blue? Are the raw values the ones that are calculated based on the data, and the predicted, the ones given by the model?

How does the calculation of ward metrics fit in with the rest of the analysis? From what I understood, having built the dendrogram as a result of clustering, the balances would be calculated using the proportions of each OTU across all samples, but in the this case, ward metrics are used instead for the numerator/denominator?

One other question was does the dendrogram heatmap in qiime consists of the balances calculated (log-ratio) or something else? It is not clear from the legend.

Finally, the R2 values I got for cross validation are higher than the general R2 value I got (0.2 vs 0.15 respectively). Is that normal?

Many thanks

mortonjt · March 2, 2018, 3:00am

We don't currently have a standard way to reject / accept the null hypothesis for the global fit. The R^2 is the amount of variation in the community that is able to be explained by the model. It may make more sense in the context of ordination -- we don't expect to have PCoAs that fit the data anywhere close to 100%, and having a PCoA with the variance in the top 3 axes is typically acceptable.

If you are getting small R2's, may want to double check with other methods such as PERMANOVA, where you can test the global null hypothesis.

How did you get your covariates for your fit? Did you run separate models on each covariate?

Concerning the fit, it is ideal to see the predict points (in red) centered within the raw points (in blue). This is to see if there is any obvious bias in the model fit.

The ward clustering just provides the tree for the ILR transform, which is used as a scaffold for calculating the log ratios.

Concerning the cross-validation -- that is probably not an issue. But if the predicted_mse is greater than the model_mse, then you have problems.

Alex_14262 · March 2, 2018, 10:21am

Thanks for making that clear!

I included all of them in the --p-formula parameter as done in the tutorial.

That's what I thought, but I got confused because in the graph below, y0=ln numerator/denominator, like in the Ward metric (from the tutorial).

I am not sure this looks okay since most of the blue dots are aligned on 0? I am currently reanalyzing a set of data that has been analyzed with qiime 1/ANOVAs.

mortonjt · March 5, 2018, 5:59pm

@Alex_14262, did we answer all of your questions? It seems like that your original questions have been answered.

One more thing to note is that the axes do have units of log fold change. It looks like you have two large outliers that may need investigating. Otherwise it looks ok.

Alex_14262 · March 19, 2018, 11:55am

Hi

I came across two more things:

1.The balances plotted in the heatmap are the observed ones or the predicted ones?

2.Why are not all the levels of a variable included in the regression .qzv of gneiss? There are levels which are left out and there is no mse/R2 value/R2 diff provided and I am not sure how to interpret this. Are they not significant?

mortonjt · March 19, 2018, 2:32pm

Neither - those are the regression coefficients. If you have continuous variables, it's the slope. If you have categorical variables, its the difference between categories. This can also be thought of as effect size.
A similar question has been posted here.

Alex_14262 · March 19, 2018, 3:14pm

Sorry I meant this type of heatmap, not the one in the regression .qzv:

mortonjt · March 19, 2018, 9:14pm

Oh XD

Those represent the partitions used to calculate balances, which are used to calculate both the predicted and observed balances. Does this address your question?

Alex_14262 · April 1, 2018, 6:52pm

Hi,

Uhm, not sure I understand . I get what the red and pink partitions mean. But I am not sure what is represented on the blue background. Is that only balance "y0"? And is it the observed or predicted one? (I assume the observed balances are the ones calculated with the ilr-transform, wheres the predicted ones are predicted by the regression?)

Despite having done the tutorial and reading your paper on gneiss, I don't really know what get out of this heatmap. For instance, I understand that the change in colour = higher/lower balances which in turn means more or less of the denominator/numerator taxa. But how is that related to the position on the y axis? If we are looking at y0, I thought you would have one value per sample?

mortonjt · April 4, 2018, 5:23am

The heatmap are just counts across features and samples. The blue background represents zero counts, whereas a red color indicates higher counts for that particular taxon in a specific sample. No balance transformations are displayed in the plot -- it just plotting the raw taxon abundances.

The main purpose of this plot is to help visualize how balances are computed from the table. This is mainly relevant with the tree on the right. The bright red bars indicate taxa that belong to the denominator of the specific balance, whereas the lighter red bars indicate taxa that belong to the numerator of the specific balance.

Every balance is calculated as follows

y_i = A \ln \frac{g(x_{bright-red})}{g(x_{light-red})}

where g(x) is the geometric mean, x are the taxon abundances across samples and A is a normalization constant.

So for the purposes of this visualization, the blocks on the left are only there to help identify from a high-level perspective which taxa belong to which balances. The actual log-ratios aren't being displayed here.

Alex_14262 · April 4, 2018, 9:06am

Amazing, thanks for the thorough explanation!