Normalization for microbiome 16s sequence analysis

abalter · June 24, 2019, 8:43pm

The way I understand things, normalization (such as in DeSeq2, EdgeR, etc.) serves two purposes: 1) Model the "real" abundance in the original samples from the read counts, 2) Make the abundance distributions conform to the needs of statistical analysis by removing heteroskedasticity, dependence, dispersion, etc.

It has been stated many times here that it is very difficult to reproduce the fold-change you get from DeSeq2 by extracting the normalized counts, but you can come close. Taken at face semantic value, "fold change" sounds like it should refer to the ratio of the "real" abundance; the fold change of "actual" expression or "actual" community representation.

So if the normalized (or normalized + VST, or normalized + MLE) abundance better represents the "real" abundance, then shouldn't I use the normalized counts for ALL of my analysis steps:

Alpha diversity
Beta diversity
F2B ratio
IgA sorting analysis
Other regression analysis
etc.

NOTE: Cross post from here and [here]Ihttps://bioinformatics.stackexchange.com/questions/8846/normalization-for-microbiome-16s-sequence-analysis) due to no action.

mortonjt · June 27, 2019, 1:35pm

Hi @abalter, I think you are hitting a key point. It is difficult to reproduce fold change mainly because it isn't possible.

It's not possible to infer absolute differences from relative data.

See our paper that just came out here: Establishing microbial composition measurement standards with reference frames | Nature Communications

These concepts apply to DESeq2, since DESeq2 assumes that the median for each sample is constant. This may not be a good choice of reference, and alternatives are discussed.

abalter · June 27, 2019, 4:35pm

@mortonjt Thank you so much for replying. As I said, I've gotten no response to this question. And congratulations for trying to tackle the issue of compositional data head-on. I look forward to digesting the article and trying out your ratio methods.

In the short term, would you say that doing the standard routine (alpha, beta, igaseq, regression, etc.) on some sort of normalized data rather than raw data would be an improvement?

Nicholas_Bokulich · June 27, 2019, 4:41pm

Definitely some sort of normalization is required for most analyses, but the same sort of normalization may or may not be appropriate for all methods.

In QIIME 2 we handle this by having each method (mostly) perform the normalization that is required. So a couple examples:

alpha/beta diversity methods have their own normalization (rarefaction in the core-metrics pipelines; see q2-breakaway for a more sophisticated method for attempting to estimate the true alpha diversity if rarefaction is upsetting)
differential abundance methods have their own normalization procedures on-board (e.g., see ANCOM or @mortonjt's methods)

It would be awesome to see other normalization methods implemented in QIIME 2 and we have some open issues — if you are interested in getting involved please let us know!

abalter · June 27, 2019, 6:04pm

@Nicholas_Bokulich thanks! Regarding the rarefaction, does the waste not want not idea not apply to that metric?

I tried googling for QIIME2 normalization and found the normalize_table.py script and differential_abundance.py script. Both have options to use DeSeq normalization. There are also some 3rd party scripts implementing percentile normalization that have been developed:

https://github.com/cduvallet/q2-perc-norm

https://github.com/seangibbons/percentile_normalization

I couldn't find one that does ANCOM.

Is ANCOM normalization implemented in QIIME2?

I'm not fully versed, but it seems to me that the more biological/clinical microbiome literature is well behind the improved methods being developed.

I did a small amount of Google sleuthing just now and found that there are quite a few papers developing percentile methods or other ways to better normalize compositional data.

This is a hot area I hope to explore!

Nicholas_Bokulich · June 27, 2019, 6:12pm

Yes and no — there has been a lot of discussion on this forum about rarefaction and the arguments for/against, see this post for a nice collection of those topics:

Those are implemented in QIIME 1, not QIIME 2

Yes! q2-perc-norm is a QIIME 2 plugin! You can see the QIIME 2 plugin library to see these and other plugins (some of which use alternative normalization techniques, including @mortonjt's methods)

Microbiome data normalization and differential abundance methods are very active areas of research, and of debate.

It is hot! Best of luck and any new methods you use/implement please consider adding to QIIME 2!

abalter · June 27, 2019, 6:39pm

Thanks again for the many answers! Just one still: is there an ANCOM plugin for QIIME2?

Nicholas_Bokulich · June 27, 2019, 6:55pm

Oh sorry missed that in the cross-fire. Yes:

https://docs.qiime2.org/2019.4/plugins/available/composition/ancom/

See the tutorials on the QIIME 2 website for some examples of using ANCOM in QIIME 2 workflows.

Good luck!