Need help understanding the outputs from GNEISS

JoaoGabrielMoraes · May 19, 2020, 3:32am

Hello,

I have 16S sequencing data from the uterus, placental, and fetal tissues from cows and heifers. To check for possible differences in abundance of specific OTUs among the tissues collected, as well as differences between cows and heifers (within the same tissue type) I used ANCOM to perform pairwise comparisons. I observed very few differences using ANCOM, so I decided to use GNEISS to look for possible differences in taxa in my dataset using a multivariate model.

However, I am having trouble understanding the output generated.

• In the regression summary file (attached) I concluded that:

1- The variables that impacted the model the most were: Site[external] and Animal Type. This make sense to me, since the external sample which (explained about 6% of the community variation) was a control sample collected from the outside of the reproductive tract to control for handling contamination. Thus, we expected a more diverse microbiome on Site[external].
In addition, we also expected to see differences in the microbiome of cows and heifers. Thus, AnimalType[Heifer] being the second most significant variable in our model, explaining about 3.6% of the community variation, makes completely sense to me.
2- Overall, our regression model can explain about 22% of the community variation. I found this result also reasonable as compared to the results from the data presented in the tutorial.
3- In our model, the prediction accuracy (pred_mse) is also less than the within model error (model_mse), suggesting that over fitting is not happening.
4- Am I missing any important conclusions from this section?

I had trouble, however, understanding the remaining plots:

• Heatmap:

1- I understood that the heatmap is plotting the coefficient p-values from the regression model for each of the balances (or ratios) for each OTU identified across sites.
2- Thus, it looks like the first three sample locations (Allantoic Fluid, Amminiotic Fluid and External) have greater differences in abundances of OTUs. Is that correct?
3- Is the dendogram showing the taxa that clustered together because they had similar ratio across sites using something such as Euclidean distance?
4- I didn’t understand the section of the plot between the dendogram and the heatmap – Y0 through Y9? Are these the top 10 balances? What exactly is Y? Why there are a total of 10 Y plotted - from 0 through 9? More importantly, what are the numerator and the denominator?
5- Why the width of some of the blue columns in the heatmap are smaller than others? Is it a result of some sites having a lower OTU diversity?

• Prediction and residual plots
1- From the prediction and residual plots my only conclusion is that we might have 4 outliers.
2- What else can you conclude based on these plots?

• Explained Sum of Squares
1- Unfortunately I have no idea of what is going on here. In the tutorial, by looking at this plot, it was concluded that the balance of y0 was important. Can you please shed some light on that?
2- Also, it was concluded that “The balances not only have very small p-values (with p<0.05) for differentiating subjects, but they also have the largest branch lengths in the tree diagram. This suggests that this partition of microbes could differentiate the CFS patients from the controls.” Can you please explain how those conclusions were drawn from the plot? Due to my lack of understanding of the plot, I only see a meaningless branched tree.

I would greatly appreciate any help understanding this output.

Thank you so much!!

balances.qza (772.4 KB) heatmap.qzv (222.9 KB) hierarchy.qza (41.4 KB) regression_summary.qzv (2.6 MB)

mortonjt · May 20, 2020, 5:00pm

However, I am having trouble understanding the output generated.

• In the regression summary file (attached) I concluded that:

1- The variables that impacted the model the most were: Site[external] and Animal Type. This make sense to me, since the external sample which (explained about 6% of the community variation) was a control sample collected from the outside of the reproductive tract to control for handling contamination. Thus, we expected a more diverse microbiome on Site[external].
In addition, we also expected to see differences in the microbiome of cows and heifers. Thus, AnimalType[Heifer] being the second most significant variable in our model, explaining about 3.6% of the community variation, makes completely sense to me.
2- Overall, our regression model can explain about 22% of the community variation. I found this result also reasonable as compared to the results from the data presented in the tutorial.
3- In our model, the prediction accuracy (pred_mse) is also less than the within model error (model_mse), suggesting that over fitting is not happening.
4- Am I missing any important conclusions from this section?

Yes, those conclusions are correct - 22% explained variance certainly higher than the average study.

I'd avoid drawing that conclusion from the heatmap - the R^2 differences are a better way for that sort of inference.

Yes, those are the top 10 balances. To find the numerator / denominator, run the balance_taxonomy command.

Its because those sites have fewer samples.

Possibly, I'd run beta-diversity to confirm that, since those plots only show the top 2 balances.

I'd ignore the scatterplots - those were designed as a diagnostic tool for the top balances to see if it is a good fit or not.

The tree branch lengths are scaled by explained variance, designed to help identify useful balances to pass into balance-taxonomy (you can zoom in and highlight the nodes of interest).

We know that the gneiss visualization is highly untuititive, which is part of the reason why we are deprecating the statistical methods in gneiss in favor of aldex2, songbird and ancom.

If you are interested in phylogenetic visualization, I recommend to checkout empress.

JoaoGabrielMoraes · May 25, 2020, 7:45pm

Thanks a lot for all your answers - @mortonjt!!

I have run ANCOM in our data to perform pair-wise comparisons. I was running gneiss because it allowed me to built a multivariate regression model. Because songbird seems to allow the same type of analysis, I'll try it next.

Regarding the beta-diversity, I calculated the Bray–Curtis dissimilarity across samples using qiime2. However, the results are not very intuitive. Any suggestions of how to plot this data so I can understand it better?

Thank you for recommending empress for the phylogenetic visualization. I'll try it as well.

For running empress, how do I generate the i-tree and --m-feature-metadata-file input files?

qiime empress plot \
    --i-tree rooted-tree.qza \
    --i-feature-table table.qza \
    --m-sample-metadata-file sample_metadata.tsv \
    --m-feature-metadata-file taxonomy.qza \
    --o-visualization empress-tree.qzv

mortonjt · May 26, 2020, 4:31pm

Great. Feel free to open up new issues if you have any questions for songbird.

Note that empress is still in prototyping stage, I'm not sure it is fully operational. @fedarko would have a better idea about this.

Regarding how to create the tree see the moving pictures tutorial
https://docs.qiime2.org/2020.2/tutorials/moving-pictures/

Taxonomies and differentials can both be directly passed into the --m-feature-metadata-file option.

fedarko · May 26, 2020, 9:16pm

Empress is in active development, so we're still working on adding features and polishing things up. For example feature metadata coloring isn't integrated in the main branch yet, but hopefully will be very very soon (). However, if you're just interested in taking a look at your phylogenetic tree (in the context of things like sample metadata), feel free to give it a shot You may also want to check out other tools like iTOL, which can also accept QZA files for trees.