Feature Table Normalization Strategies in QIIME2

Hello!

I had a question about methods for normalizing a feature table in QIIME2.

I’m working with environmental samples that come from similar but varying climates and substrates, and so I’m generally trying to explore shifts in the bacterial community structure in relation to these environmental differences. Because I have a variety of environmental sources for my samples, the number of sequences in each sample varies quite a bit as well (which may be a reflection of the varying community diversity in each sample too). As far as I could tell, the primary strategy for normalizing feature tables for downstream analyses in QIIME2 so far appears to be rarefying. However, because the sequence counts in my samples sometimes vary somewhat significantly between sampling environments, I’m not sure if rarefying my feature table makes the most sense as I would be ignoring a lot of sequences from several samples that could contain useful data. Am I misguided for thinking this, or are there better normalization strategies that I could work with?

I had previously been working with QIIME 1, and would normalize my Otu tables by 16S copy number. However, after reading more about that, that method seems to be discouraged due to copy number inaccuracies. Alternatively, QIIME 1 had normalize_table.py where you could normalize your Otu table through DESeq2 or Cumulative Sum Squaring (CSS). It doesn’t look like either of these methods are currently implemented in QIIME 2, but I was considering using normalize_table.py for CSS on my exported QIIME 2 feature table (.biom) in QIIME 1, and importing the normalized table back into QIIME 2 for further analyses. Would this make sense, or is there a better method of doing this?

Are there other feature table normalization methods (implemented through QIIME 2 or otherwise) that anyone might recommend? I’ve looked through some other forum posts about this topic, but it looks like they didn’t come to strong conclusions.

Any and all help is greatly appreciated! Thank you in advance!

  • Nate
1 Like

Hi Nate,
Adding alternative normalization strategies has been on our radar for a while. We have an open issue here to track progress. We will post here with updates when those are added to a future QIIME2 release.

However, non-rarefying-based normalization is built in to some QIIME2 analyses, e.g., ANCOM for differential abundance testing. So separate normalization may not be necessary (or even undesirable), depending on your goals.

For now, rarefying is the only normalization method built into QIIME2 for alpha and beta diversity analyses, though others are planned for the near future.

Alas, there is not currently a better way.

I hope that helps!

1 Like

Hi Nicholas,

Thank you for the clarification and for the quick response! I’ll test CSS and check some other normalization methods for my data.

Thanks,
Nate

1 Like

Hi Nate,

I’ve done a good deal of thinking about these topics.

First, you say that the number of sequences vary a good bit between samples. This generally does not reflect true biological information, but library prep/sequencing issues. If you follow a standard protocol you should be putting in the same amount of DNA for each sample. You should be getting roughly the same amount of sequences. The community composition of those sequences are the true biological information that may vary, and it should not be the sequence counts. Now of course it’s not perfect and variation is going to happen because of library prep/sequencing issues. I would use extreme caution in thinking that sequence counts is a reflection of true biological information.

Based on the Weiss et al. 2017 paper “Normalization and microbial differential abundance strategies depend upon data characteristics.” A rough guideline to follow is if the sequencing depth varies greater than 10x you should probably go with rarefying. I generally use CSS and have sequencing depth that varies around 10x in most of my raw libraries. When I do CSS that usually brings the variation in libraries to something more like 3x. The Weiss paper does not directly address this, but I suppose that if you use CSS and that can get you’re variation in sequencing depth to below 10x then you might be ok. I am working with a group compiling several datasets and in that case we needed to rarefy because of the large variation in depth across datasets.

I looked a bit into DESeq, but ended up going with metagenomeSeq and CSS because it seemed a little bit more straightforward. I usually do the CSS normalization in R, and only recently learned they had implemented it into QIIME 1. I can share code on CSS normalization in R if anyone has interest. The issue there is that something is going on with the compatibility of the .biom files between metagenomeSeq and phyloseq and then getting those biom files back into QIIME don’t work.

5 Likes

This would make a great community tutorial!

Thanks for the input @CarlyRae! Your experience with these different techniques will be a useful guide when we get around to implementing other normalization techniques in QIIME2. (and if you have any interest in contributing a plugin to do these normalization steps, get in touch on the developers category and we would be very happy to help)

1 Like

Hi CarlyRae,

Thank you for mentioning that paper! I had not read it yet, and it definitely will be a useful guide for my data. I’ll have to test a few of these different normalization methods, but like you mentioned rarefying may end up being the better option.

And yes, thank you for clarifying the relation (or lack of) between sequence counts and true biological information. I was certainly over-generalizing with my observation within my own data, but I won’t be relying on this observation for actual analyses and conclusions.

  • Nate

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.