how to compute alpha diversity and beta diversity without rarefaction

Hi all,

I have over 298 samples with incredibly varied sequencing depths, with the rarefaction curve shown below.

I was struggling with how to rarefy an optimal sample size. Even rarefying to 10000 reads (first red line), 169 samples would be removed. However, if I select a smaller one, the diversity would be underestimated. I wonder whether there is an alternative way to normalize the data rather than rarefying, so that I can keep all samples or at least most samples while computing alpha diversity and beta diversity.

Hello Wei_Zhang,

You are correct, there is a tradeoff between keeping more samples and keeping more depth, which has been discussed here and here. I don't think there's a perfect solution.

I have not tried this, but you could try the SRS tool!


Hi @colinbrislawn ,

Thank you for the suggestion. There is indeed no perfect solution for this issue. However, I check the alpha diversity, especially the Shannon index, and found that Shannon does not change after rarefaction (figure below). It suggests that the Shannon diversity reaches the plateau and I can compare them after rarefaction. Besides, I have no idea how to deal with the beta diversity. Does the rarefying impact a lot?

1 Like

That's correct, and that is also expected for Shannon's entropy.

I really like that you tried it both ways and tested the result. I'm a big fan of this method! :+1:

Yes, and it's been heavily debated.

Why subsampling is (always!) bad: Waste not, want not: why rarefying microbiome data is inadmissible - PubMed
Why subsampling is (often!) fine: Normalization and microbial differential abundance strategies depend upon data characteristics - PMC


Hi @colinbrislawn ,

Thank you for the reply. I carefully read these papers, and I summarize them a little bit.

  1. Regarding alpha diversity, if the diversity does not change with library size, we can compare them after rarefaction.
  2. Regarding beta diversity, it seems that proportion always outperforms other methods, thus proportion is recommended. On the other side, if I'd like to investigate the influence of other factors to the microbiome community, instead of comparing sample-wide distance, does proportion still work well?
  3. Regarding differential taxa, there are numerous ways to compare them (DOI: 10.1038/s41467-022-28034-z). There is no one optimal option for all datasets.

Because I am not a statistician, I will not offer too much advice.

There is one piece of information, which is illuminating if you have not found it already.

This is because