I have 16S sequencing data from the uterus, placental, and fetal tissues from cows and heifers. To check for possible differences in abundance of specific OTUs among the tissues collected, as well as differences between cows and heifers (within the same tissue type) I used ANCOM to perform pairwise comparisons. I observed very few differences using ANCOM, so I decided to use GNEISS to look for possible differences in taxa in my dataset using a multivariate model.
However, I am having trouble understanding the output generated.
• In the regression summary file (attached) I concluded that:
1- The variables that impacted the model the most were: Site[external] and Animal Type. This make sense to me, since the external sample which (explained about 6% of the community variation) was a control sample collected from the outside of the reproductive tract to control for handling contamination. Thus, we expected a more diverse microbiome on Site[external].
In addition, we also expected to see differences in the microbiome of cows and heifers. Thus, AnimalType[Heifer] being the second most significant variable in our model, explaining about 3.6% of the community variation, makes completely sense to me.
2- Overall, our regression model can explain about 22% of the community variation. I found this result also reasonable as compared to the results from the data presented in the tutorial.
3- In our model, the prediction accuracy (pred_mse) is also less than the within model error (model_mse), suggesting that over fitting is not happening.
4- Am I missing any important conclusions from this section?
I had trouble, however, understanding the remaining plots:
1- I understood that the heatmap is plotting the coefficient p-values from the regression model for each of the balances (or ratios) for each OTU identified across sites.
2- Thus, it looks like the first three sample locations (Allantoic Fluid, Amminiotic Fluid and External) have greater differences in abundances of OTUs. Is that correct?
3- Is the dendogram showing the taxa that clustered together because they had similar ratio across sites using something such as Euclidean distance?
4- I didn’t understand the section of the plot between the dendogram and the heatmap – Y0 through Y9? Are these the top 10 balances? What exactly is Y? Why there are a total of 10 Y plotted - from 0 through 9? More importantly, what are the numerator and the denominator?
5- Why the width of some of the blue columns in the heatmap are smaller than others? Is it a result of some sites having a lower OTU diversity?
• Prediction and residual plots
1- From the prediction and residual plots my only conclusion is that we might have 4 outliers.
2- What else can you conclude based on these plots?
• Explained Sum of Squares
1- Unfortunately I have no idea of what is going on here. In the tutorial, by looking at this plot, it was concluded that the balance of y0 was important. Can you please shed some light on that?
2- Also, it was concluded that “The balances not only have very small p-values (with p<0.05) for differentiating subjects, but they also have the largest branch lengths in the tree diagram. This suggests that this partition of microbes could differentiate the CFS patients from the controls.” Can you please explain how those conclusions were drawn from the plot? Due to my lack of understanding of the plot, I only see a meaningless branched tree.
I would greatly appreciate any help understanding this output.
Thank you so much!!