Gneiss usage and interpreting output

jmb · January 5, 2018, 6:19pm

I am using the gneiss plugin and have a few questions about its usage and output:

Is it possible to have nested factors in the model? For example, if you primarily want to compare balances in two different species but you have multiple samples from different individuals (e.g. four individuals for each species with each individual represented by samples from cecum, large intestine, and fecal), can you nest individuals within species to account for the statistical dependency of the samples from a single individual? If you can't do this would a model that includes the factor species still produce unbiased coefficients, but potentially biased p-values?
I have some questions about interpreting the output of balance-taxonomy. The first two questions are based on the attached file: y0_taxa_summary_lower_taxa4.qzv

(I) As seen in this file, sometimes (but not always), the colors get switched between the top log ratio plot and
the proportion plot (i.e. the species represented by the green box in the top plot is represented by orange in the
proportion plot). Is there a way to control this?

(II) What determines which taxa show up in the proportion plot? In this case, 10 taxa from the the numerator
and 10 from the denominator are shown in the plot. There are obviously hundreds of taxa not shown. Are the
taxa shown just a random selection, or are they something like the top 10 most influential?

The last question is based on the attached file: y3_taxa_summary_Upper_taxa6.qzv

(III) Why does one bar span both the numerator and denominator sections of the plot? Is there a way to change
this?

Thanks, I really appreciate any input!

jmb · January 5, 2018, 6:21pm

Sorry, I forgot to attach the files!y0_taxa_summary_lower_taxa4.qzv (137.2 KB)
y3_taxa_summary_Upper_taxa6.qzv (90.4 KB)

mortonjt · January 5, 2018, 8:10pm

Very good question.

I think nested factors are actually enabled (although I never had a test case where I explicitly tested this). The backend engine that parses statistical formulas is patsy. According to the documentation here it looks like it can handle this - which is super cool! Give it a shot, I'd be curious to see if it works.

2 (I). I reported the issue here, not sure why the colors are inconsistent, but it could be due to how seaborn handles their default colors. The current workaround is to use Gneiss Python API to control the specific colors that you want. There is a tutorial here on how to use this sort of functionality.
2 (II). We take the top ten taxa with the largest positive log fold change and the top ten taxa with the largest negative log fold change.
2 (III) That is weird. Raised issue here

jmb · January 5, 2018, 10:34pm

Gniess_lower_summary_nested.qzv (1.8 MB)

Thanks again!

jmb · January 5, 2018, 10:36pm

Sorry I don't know why but most of the text from my last reply doesn't look like it showing up. Here it is:

Thank you very much for the prompt response! I looked at the patsy documentation and to nest individual within species I set up the formula as "Species/Individual" and it appears to have worked. I am attaching the output in case you are interested in looking at it, and in cased I am misinterpreting whether it actually worked. There is a large increase in the R-squared (from 0.2661 without nesting to 0.894 with nesting) and the p-values for the balances have changed as well. So unless I'm missing something (which is certainly possible!), it looks like this type of analysis can be handled. Also, in case it helps diagnose the issue, one more point about the last question regarding the bar spanning the numerator and denominator--some version of this seems to happen whenever there are fewer than ten taxa in the numerator and/or denominator.

jmb · January 10, 2018, 8:11pm

I just wanted to follow-up on the previous posts in case other users are interested in whether Gneiss can handle nested analyses. Although at first I thought the analysis I described above worked, the more I look at the result the less confident I am. One thing I guess I don't understand is the MSE values that are printed for the different factors. First, I don't really understand why there is a separate MSE calculated for each individual (and p-values reported on the balances) as opposed to simply having a value for the factor "individual nested within species". In addition, MSE is calculated for factor combinations that are not possible given the nested structure of the dataset (i.e. individuals appear to be crossed with species instead of nested within species). Last, I also ran the analysis using only "individual" as a factor and the R-squared value is exactly the same as the nested analysis, which seems strange to me. I would be interested to know whether anybody has any input on how to further troubleshoot this to make sure that it is working correctly. As a work around for now, I have just collapsed my table by individual and carried out the analysis comparing the species. I think this should be ok, but it would be great to know if Gneiss can handle nested analyses more directly since I'm sure a number of people would be working with such datasets. Thanks, I appreciate any feedback!

mortonjt · January 10, 2018, 8:36pm

Hi @jmb. I'm looking at your nested analyses results a little more carefully. One red flag that I'm noticing is that you don't have that many samples (24 samples is quite small). At least, this is small with respect to the number of covariates ( which are 16 covariates). It's possible that the reason why the R^2 is so high is because there are almost as many covariates as points, and you are overfitting close to a perfect fit. The rule of thumb in regression analysis is to have the number of covariates < 10% of the samples. So if you have 100 samples, you should have 10 covariates.

So I'm not sure if you could even do nested analyses in this particular case. I'd only stick with trying to test only 1 covariate, because your dataset too small to allow for more complex analyses.

Concerning the MSE question, the MSE in the top table is only used to validate the effect size of a given covariate. This is done by removing the covariate from the model, fitting the reduced model on the dataset, and seeing how different the reduced model error is compared to the full model error.

jmb · January 15, 2018, 6:00pm

Hi,

Thanks for getting back to me. What was really confusing me is why there would be so many covariates in the model in the first place, given the nested design. Moreover, some of the covariates are for factor combinations that do not exist (as an example if you look at individual A2, it is being crossed with species when it should be nested within only one of the species). I've done a little more research and tinkering and I think the problem may have been with how individuals were coded within the metadata file. In my case I had every individual with a unique id. If, however, I recycle ids across the species (i.e. species A has individuals 1-5 and species B has individuals 1-5) it seems to work more as expected (see attached file: Gniess_lower_summary_nested2_indivrecode.qzv). But, this raises another issue, which is whether it would be treating the nested factor as a random effect (as it should)? If it does, great, but I'm thinking probably not since I see you have a separate command for running mixed models. Is this what is being run behind the scenes when you run the lme regression?: statsmodels.regression.mixed_linear_model.MixedLM.from_formula - statsmodels 0.15.0 (+594)

If so, it looks like it should be possible to run a model with nested random effects, but it would require being able to calculate and provide the variance components (vc_formula) and variance structure (re_formula). The first example at the bottom of the page referenced above would be representative of the analysis I am trying to accomplish. In my case, "species" would be equivalent to "school (the grouping factor)" and "individual" would be equivalent to "classroom." Would something like this be possible with the Gneiss plugin? From what I can tell you can pass a grouping factor, but I don't see options to set the variance components or structure. Please let me know if you think there is a way to do this. I am assuming that this information might be helpful to others as well since I would expect nested random effects models would be pretty common, but if this is becoming too specific to my situation to be useful for others let me know--I know you are super busy!! Thanks, I appreciate they great support on this forum.Gniess_lower_summary_nested2_indivrecode.qzv (1.4 MB)

mortonjt · January 16, 2018, 6:01am

Hi @jmb, currently the option to specify variance components within linear mixed effects models does not exist (mainly because we haven't had a use case for it before).

But you can definitely get started on this without the official gneiss commands. It'll boil down to unpacking the balances, and running the MixedLM model on each balance (this is essentially what is going on within Gneiss underneath the hood). This will allow you to specify the vcf_formula for each equation.

jmb · January 17, 2018, 9:25pm

Thanks for the suggestion! For anyone interested I ended up just exporting the balances as suggested and analyzing the data using lme4/lmeTest packages in R and everything worked well.

system · February 18, 2018, 3:35am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.