Percentile normalization with perc-normalize

Hi @cduvallet ,

I am currently working on a dataset where a group of the samples had to be resequenced and now when I analyze the samples there seems to be a batch effect between the two sequencing runs. While looking for a way to fix this I came across your posts for the perc-normalize plugin. I wanted your opinion on if you think this could work with my samples. We have roughly 380 oral microbiome samples with about 16 negative sequencing controls. According to the documentation you normally use case-control data for this plugin, but could normal samples just be “cases” and sequencing controls be “controls” or will that not work since the controls will be too low in abundance? Alternatively, could one sequencing run be called a case and the other a control?


Hi @Zach_Burcham!

In theory, yes: if you have a group of samples in both batches which are biologically "the same" (i.e. healthy patients, samples all treated with condition A, etc), then you should be able to use q2-perc-norm for that purpose.

Again, in theory yes you could do this, but you're right that it probably won't work. The sequencing controls will likely have very few OTUs in common with the real samples (and if they do, they may be very low abundance), making the percentile normalization meaningless. For example, an OTU which is absent in your sequencing controls but present in your real data will be converted to 1.0 in all your real samples. I also don't think that you trust the OTU abundances in your negative sequencing controls, as these are (by definition) just noise, right?

Definitely not. The point of the method is to identify a group of samples that can be considered "the same" in both of your batches, use these as an anchor to compare all the other samples to (i.e. use them as the null to normalize all other samples relative to), and then combine the data across groups. If you consider one run cases and the other controls, you not only aren't identifying a group of samples in both groups that are comparable, but you also no longer have two batches to combine. Does that make sense?

I think you have three options here:

  1. Just do your analyses and remove any results that could be due to batch. For example, if you're doing beta diversity analyses, only consider comparisons that are within-batch.
  2. If you have a subset of samples that are the same condition in both sequencing runs, then use that as your "controls" to normalize against. But that depends on the exact experiment you're running, if any -- it won't work if all 380 samples are just cross-sectional samples from different people.
  3. You might be able to use batch correction methods that have been developed for other 'omics data. In our paper, we used ComBat and saw that it worked okay to reduce batch effects. I think it mostly shifts the "mean" of the data (I think it assumes that the data in both batches is distributed the same, e.g. has same variance), so it might work depending on what type of batch effect you have. There are other methods as well, though I'm less familiar with them. @seangibbons might have more to say about this as well!

Hi @cduvallet,

Thank you for the detailed response. That makes a lot more sense now!


I wholeheartedly agree with everything Claire said above. Once caveat for running a method like ComBat is that we saw really weird artifacts for low-abundance OTUs (i.e. ComBat does weird stuff to zeros). These taxa become even more batchy after the correction. This can be avoided by filtering out low-abundance stuff before running your tests. In addition to trying to batch-correct the underlying data prior to running a statistical test, you could also use a statistical test that allows for inclusion of ‘batch’ as a variable in the test.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.