Batch Correction - Dealing With Negative Values for Diversity Analysis


I am analyzing a dataset containing samples from two separate sequencing runs. I processed the reads identically, including trimming the same number of bases off the ends during the DADA2 step, before merging the feature table and representative sequence files. PERMANOVA on the wUnifrac distance matrix showed that the batch effect is significant (p = 0.001) and the test statistic for batch is greater than the test statistic for the experimental groups of interest.

I used the ComBat function of the ‘sva’ R package to batch correct my feature table, then imported it back into R. I ran the feature-table summarize function on this table without any problems. However, when I tried to run core-metrics-phylogenetic I got the following error:

Plugin error from diversity: count < 0

The ComBat normalization introduced negative values into the feature table, and I assume this is causing the issue. However, when I analyzed this data previously in QIIME1 I was able to run diversity analyses with batch corrected data containing negative values.

My questions: Is there currently no method for dealing with negative values in QIIME2? Can negative values in normalized data be thought of as ‘effectively 0’, or does it make a big difference if one feature has a count of -5 and another feature has a count of -10? Are there others strategies I should look into that may avoid this problem?

Hi @Zachary_Bendiks,
Sounds like you are in a pickle! :cucumber:

First off — I applaud your due diligence in appropriately structuring your runs and testing for a batch effect proactively.

No. If qiime1 could handle negative values, it sounds like a bug. What does an abundance of -5 really even mean!? (note that ComBat is designed for microarray data as far as I know, not for microbiome data where negative abundance is troubling) Most downstream methods in QIIME2 are assuming that the data in a FeatureTable[Frequency] artifact are counts, not some type of normalized data, and so negative values will break some of those methods’ assumptions.

I do not know enough about ComBat and whether it performs some type of count normalization in addition to batch normalization — is the output still considered “count” or is it something like mean-centered value? So the difference between -5 and -10 could actually be meaningful; converting all negative values to 0 might not be kosher, nor could adding N to all values where N = the min value.

@cduvallet might have some advice on working with ComBat data.

Depending on your experimental setup, there might be.

The brand new q2-perc-norm plugin is designed for normalization of case-control studies (and was compared to ComBat in the original paper).

Your batch effects may also be from contaminants that are unique to each run. If you have negative controls, you could try decontam (not yet a QIIME2 plugin but we hope to add soonish). See this post for some more details:

Let us know what you think! And let’s see if @cduvallet and @benjjneb have any other advice.

Sounds like you are in a pickle! :cucumber:

Yeah haha, though thankfully the batch effect is much less pronounced when I process the data in QIIME 1 (consequence of OTU picking v. ASV picking? I’m not sure). If worst comes to worst I’ll just use my old QIIME1 workflow.

No. If qiime1 could handle negative values, it sounds like a bug. What does an abundance of -5 really even mean!?

That’s definitely been a point of discussion in our lab. I remember that normalizing QIIME1 data with DESeq2 (not CSS) would sometimes give negative counts as well, but I don’t recall any of the downstream scripts having issues with this.

Reading into it more, it seems like ComBat is scaling the counts based on some “gold standard” microarray data, which does not sound good for 16S amplicon data.

Depending on your experimental setup, there might be.

It’s a study comparing a control and an experimental group, so the pern-norm plugin might be what I need. I will look into those packages and see what they can do for my data. Thanks a lot for the help!

I agree this is likely a consequence of OTU picking, though that effect should be somewhat less pronounced with wUniFrac (I’d expect strong differences unweighted UniFrac but not necessarily weighted).

That is actually an interesting argument for OTU picking vs. denoising, and is one of the practical purposes of OTU picking — reducing noise by collapsing similar sequences into OTUs.

Still, I would be wary. If the problem goes away with OTU clustering, that is not necessarily a good thing, and I wonder what else might be lost in the process. I would focus on looking at the ASVs that are unique to each run and consider whether these may be contaminants that could be removed by decontam or by brute force (if they are very clearly contaminants or non-target DNA). OTU clustering is a bit of a “sweep it under the rug” solution, but I suppose it is practical in this case.

QIIME1 was much more flexible, for better or worse. QIIME2 is strict about defining semantic types and validating these types to make sure users don’t stick their data where it doesn’t belong, thereby breaking the assumptions of certain methods (e.g., using non-normalized data where normalized is expected, or vice versa). So sometimes in QIIME1 “it worked…” might not mean “…as intended”.


Awesome! I hope that fits your needs.

The output of ComBat is certainly not something that can be considered a “count” (if only because it’s continuous values, not discrete!) From what I understand, the output of ComBat doesn’t have particularly meaningful units.

My intuition tells me that the difference between -5 and -10 is certainly meaningful (if it’s on the same order of magnitude as other differences); converting all negative values to 0 is certainly not kosher; but adding N to all values might be. I would confer with a statistician to confirm!

Really? Would you mind sharing the source of that? I’m surprised because from looking at the paper it looks like ComBat just estimates (and then finds some factors to minimize) a few parameters, based on the input data.

Regardless of what exactly ComBat is doing, the output is almost certainly not valid for the majority of ecology-based alpha diversity metrics, but there may be some more field-agnostic ones (e.g. rooted in information theory or something) that could work?

Right - if you randomized your samples across batches and have enough “controls” in the two batches, you could percentile-normalize your data relative to these “controls” and then combine that data together. Two notes on this front:

  1. “control” samples don’t actually have to be healthy - they just need to be comparable (i.e. of the same “group”) in both batches. But it looks like you have case-control data so it should work as described in the paper!
  2. the output of percentile-normalization is also not something that can be thought of as counts. Calculating alpha diversity on these numbers is not appropriate. Depending on which distance you’re using, beta-diversity should be ok (as long as the distance metric is agnostic to units, AKA not an ecological metric)

The other option is to just do your analyses on each batch separately, and combine them when you can. For example, I imagine alpha diversity shouldn’t be too affected by batch (though you’d want to check, of course). And beta diversity calculations are fine to combine, as long as you only calculate beta diversity between samples within the same batch.


Oh and one more note that if there are functions associated with percentile normalization that would make the plugin better/easier to use, I’d be happy to take suggestions and see if I can write them up! The plugin currently only has the most basic functionality, but if I can reasonably add things to make it more useful to more people then everybody wins!

You can make suggestions by posting on this forum or raising issues on the github repo.


Really? Would you mind sharing the source of that? I’m surprised because from looking at the paper it looks like ComBat just estimates (and then finds some factors to minimize) a few parameters, based on the input data.

You’re right. I was looking at the mComBat paper, not the original one. Thanks for pointing that out.



Thanks again for your suggestions! Based on the responses, I corrected my data using the perc-norm plugin. However, I am still getting very strong batch effects (in fact, the wUniFrac distance between the batches is greater than it was before the correction).

I am not sure if I set up the ‘control’ and ‘case’ samples correctly. I called all samples from one experimental group the ‘controls’ and all samples from the other group ‘cases’. In this way, I essentially ignored which batch each sample came from. Is this correct? Do I need to indicate which sequencing batches the samples came from?

Attached is the mapping file that I’ve set up for the study. Ultimately, I want to reduce the batch effect as much as possible so I can identify which taxa differ between the experimental groups at each time point. Any advice you can offer is greatly appreciated!
LSU_mapfile_Z_and_E_batch_no_controls_for_perc_norm.txt (24.5 KB)

You’re almost there! I’m assuming you called percentile-normalize directly on this metadata table, right? Unfortunately, you need to run percentile normalization on each batch separately - so you need to call it twice, once on a column where cases and controls for batch1 are labeled and another time on a column where cases and controls for batch2 are labeled.

I’ve been considering making some sort of wrapper function that can (1) percentile-normalize each batch separately from one command and then (2) merge the resulting percentile-normalized tables into one table. Unfortunately that code isn’t written yet so you’ll have to export your percentile-normalized OTU tables to python (or whatever coding language you use), merge them in that, and then re-import them into qiime for downstream analyses. Not ideal, I know - but this is very helpful to better understand how users would need these functionalities!

Regardless, it looks like you only have 6-12 control samples in your second batch, which isn’t really enough for the method to work. We recommend having at least something like 30 samples in the control group.

Also, technically, I don’t think you should be lumping your pre and post samples into the same category, because an underlying assumption of the method is that control samples are from a homogenous population. @seangibbons - do you agree/additional thoughts? Looks like @Zachary_Bendiks has measured samples pre- and post- perturbation, with one group getting a “control” treatment (A(Amylose)) and one group getting a “real” treatment (B(Amylopectin)).

Hi @cduvallet and @Zachary_Bendiks. Yes, I agree with Claire on the pre- and post-samples. If you combine these samples you’re assuming that these samples are drawn from the same population. If you think that this is ok (i.e. time doesn’t matter and samples are more-or-less the same in both groups, or you want to ignore the affects of time) then you could potentially move forward. Just be aware that you’re making a weird choice from a stats point of view.

As far as the # of samples in your control group, Claire is also right that you want a larger number for the transformation to work properly. 12 controls would be pushing the lower limit on what’s useful, and 6 is way too small. With too few samples, you’ll end up with a choppy, 8-bit percentile distribution that won’t be very quantitative.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.