# statistics on alpha diversity

Hello to all the statisticians out there!

I am struggeling with the significance tests of my study. I sampled 24 orchards (5 samples per orchard) from which 12 are without treatment and 12 are with treatment. I want to know if the treatment has an impact on alpha diversity. Therefore I tested significance of my categorical variables with Kruskal Wallis test but I don`t understand what to use for the numeric variables. So,

1. how to test the significance, the correlation respectively of alpha diversity and e.g. nitrogen content in the soil?

First I thought I can use qiime diversity alpha-correlation (using --p-method pearson if the diversity index is normally distributed and spearman if the index is not normally distributed). Btw. do both variables have to be normally distributed to use pearson?
But then I used pearson/spearman in R and it gives me only the correlation coefficient. So is the "Test statistic value" in qiime the same as the spearman correlation coefficient? But then what does the p-value mean in qiime? So I used the wilcoxon rank sum in R, but the outcoming p-Value is completely different from the spearman test in qiime and wilcox is actually testing alpha diversity index with categorical data. Sorry I did not find something about alpha diversity and numeric variables in this forum which could help me.

Is there maybe something like adonis for alpha diversity outside of qiime2 (as far as I read in this forum there is nothing in qiime)?
2. Is it right that I can`t use alpha diversity longitudinal because it is only for time series or paied samples?

• ANOVA I guess I can`t use because nothing is normally distributed...
Sorry, I am messing up all of these tests, I hope anyone can help me...

Hi,

Based on your description, your experiment is a nested design, i.e., samples are nested within orchards and orchards are nested within experimental treatments. Therefore, statistical models that assume samples are independent, such as ANOVA and Krustal Wallis test, shouldn't be used for modeling your data. You need to model your data using linear mixed effects models, treating experimental treatments and nitrogen content as fixed effects, and orchards as random effects. You can do significance testing for both categorical and numerical variables in linear mixed effects models.

Resources on linear mixed effects models:

No. It's the model residuals that should be normally distributed not the variables themselves. See paper by Ernst and Albers, 2017 on the misconceptions about the assumptions behind the standard linear regression model.

Yes, qiime2-longitudinal was specially designed for dealing with longitudinal data.

The variables do not need to be normally distributed. It's the residuals that should be normally distributed, which is acutally not an important assumption. See paper by Ernst and Albers, 2017

2 Likes

First,
thank you very much for this detailed response!!!

I will go through it carefully!
Sorry for my incorrect formulation, it `s the distribution of the residuals! My ANOVA looked like this, B is without treatment, C is with treatment:

So there is a lot of spreading in the residuals as I concluded?

Okay, so I cannot assume alpha diversity as my indepent variable and the B/C (treatment, no treatment as dependent group in Kruskal Wallis?

*sorry I forgot to say, the other variable is the crop type, as I had two differents and wanted to see if that would affect alpha diversity

The spread of residuals is not that different among different experimental groups. If you're concerned about heteroskedasticity, you can run heteroskedasticity-robust F test or use robust standard errors (e.g., HC4) in R.

Based on your description, you'd want to use your treatments (B VS. C) as the independent variable and alpha-diversity as the dependent variable. You shouldn't use Kruskal Wallis test because your samples are not independent of each other, i.e., your samples are clustered within orchards.
.

Thank you!
I will try these " heteroskedasticity-robust F test or use robust standard errors (e.g., HC4) in R".
If I only use the means of each orchard for statistical testing, could I then use Kruskal wallis? Or would you not recommend working with the means? (I have quite high variance in alpha diversity within orchards and for some orchards also the nutrient contents have a conciderable variance.)

Yes, that's one way to do it. In that case, you have independent observations.

I'd prefer mixed effects models that use all the data you collected. By aggregating data within orchards, you throw away useful information like variance among orchards with the same treatment.

Alright, sounds plausible!