Alpha-diversity after filtering

Hello everybody,
I am only starting now to investigate the microbiome during my master’s degree and I was wondering if alpha-diversity indices would still be relevant after filtering since they are calculated from estimated unobserved species (that are based on species that are observed only once, or twice with jacknife 2…).
My team is using c=0,005% as recommended by Bokulich in 2013 but while getting rid of sequencing errors and polymerisation errors I am afraid we also lose useful information.

Thanks a lot for your answers,

Florian Touitou
Master’s Degree in Animal Nutrition
Vet School Student in Toulouse, France.

This is particular to a handful of alpha diversity metrics like chao1, which measure singletons — filtering singletons would invalidate those diversity metrics, but all others (like richness, PD, Shannon, evenness, etc) would be unaffected (in the sense that singletons are not used to estimate unobserved species). I believe the Bokulich 2013 paper briefly mentioned those caveats.

I recommend just sticking with other alpha diversity metrics that are not sensitive to this issue.

Note that those recommendations were made for OTU clustering. If you are using dada2 or deblur for denoising, you do not need to set an abundance filter ("the Bokulich Method"), though others on this forum have reported continuing to use that method. Either way, denoising methods do set their own filters (both remove singletons by default), so unless if you disable those filtering parameters those methods also invalidate the use of alpha metrics like chao1!

Good luck!

We’re actually using a homemade pipeline…
I believed that all richness estimators such as Chao1, ACE or jackknife were using singletons at one point or another. For Chao1 it is obvious but ACE is using the estimated coverage which is calculated with singletons, and jackknife stems from unobserved species that are calculated with singletons… (I may be wrong, that is what I understood from : Eric Marcon, Mesures de la Biodiversité, which is in French unfortunately…).

I also have trouble using Shannon’s index since -ln(ps) for species that are particularly rare will increase a lot and will not be compensated by the ps term. Filtering real species (besides real errors) will decrease Shannon and introduce a bias between samples.
Simpson’s index is less sensitive to those rare species from what I understood.

I hope I did not get lost with all the digging into formulas :confused:

Thank you very much for your quick reply !

Yes, Chao1, ACE, and jackknife all do. But not species richness a.k.a. observed species (or more frequently observed OTUs/ASVs in the case of microbiome results). So you do have options.

Yes, filtering impacts Shannon's H, but the question is: are those rare OTUs real? If not (e.g., because they are noisy OTUs), then you do not want to account for those in calculating H and the change is correcting this index. But if yes the majority of rare OTUs are real, then filtering deflates the true value, though this effect is probably applied evenly across groups so does not impact relative comparisons of H too strongly. So I would argue that Shannon should still be used even with this type of filtering.

Not at all! It is good to grapple with these questions — better than to, say, blindly use Chao1 assuming that it is appropriate here. You always need to be aware of a metric's assumptions, and since microbiome data has some constraints (e.g., abundance filtering thresholds) you just need to use the metrics that are appropriate...

Thank you very much for your very helpful answer. I am probably going to use and interpret Shannon’s entropy and Simpson index being really careful with the conclusions and explaining them showing the true impact of filtering on the formulas.

It’s been really nice !
All the best for the future.