diversity metrics and batch effects

yipinto · December 3, 2019, 9:19pm

I have a project with several sequencing runs. When looking at diversity metrics it seems that I have batch effects. I found the q2-perc-norm plugin, but I'm not sure if it safe to use percentile-normalized "counts" for diversity metrics. I was thinking to filter batch specific features (with its pros and cons). Any other idea to deal with batch-effects?
Thanks!

Nicholas_Bokulich · December 4, 2019, 4:46pm

Hi @yipinto,

I think that approach is all cons, UNLESS if the contaminant features you are filtering are clearly contaminants, e.g., known reagent contaminants or other features that you can positively say will not be found in your samples.

cc: @cduvallet

See here for some other ideas:

cduvallet · December 5, 2019, 4:04am

Hi @yipinto!

The output of percentile-normalization is definitely not "counts", and should not be used in metrics that require counts. For example, Chao1 uses the number of singletons and doubletons to calculate alpha diversity, and so would be inappropriate for use with percentile-normalized data.

Unfortunately, we haven't gone through and identified which of the many metrics available are appropriate to use with percentile-normalized data, so you'll have to go through and see how each one is calculated and whether they make assumptions about the data that might not be applicable here.

I think one of the most important things to note with the percentile-normalized output is that we add random noise to the zeros (to prevent pile-up of ranks, see more here), so any metric that uses zero as a meaningful value will not work for percentile-normalized data.

If you have questions about any specific metrics that you've thought about but can't figure out, feel free to post again and we'll see if we can figure it out together (make sure to tag me and/or @seangibbons so we see the post).

Another option to deal with batch effects (beyond the ones suggested in the post linked by @Nicholas_Bokulich ) is just doing your analyses on a per-batch basis and then comparing results across batches. (e.g. calculating beta-diversity just on samples within each batch, and excluding all cross-batch calculations).

seangibbons · December 5, 2019, 5:14pm

I 100% echo what @cduvallet said here. You probably shouldn't use percentile-normalized data to calculate alpha- (or even beta-) diversity metrics, without putting a lot of thought into your interpretation of the data. Random pseudo-counts to replace zeros is one issue (so metrics like Jaccard don't make sense anymore because there are no longer zeros/absences). Another issue is that any relative abundance information within a sample is lost (i.e. the counts for each taxon are normalized to the control distribution for that taxon across samples). Thus, all percentile-transformed 'abundances' are numbers between 0 and 100. There's no longer any way to distinguish what taxon is more or less abundant within a given sample. So, if this within-sample relative abundance info is necessary for the diversity metric that you are calculating, then you probably shouldn't use percentile normalized output for that calculation.

The original application of percentile normalization is differential abundance testing on a taxon-by-taxon basis (i.e. is 'taxon X' enriched/depleted in cases vs. controls). It works well in this application, but extreme caution is advised for other use cases.

jwdebelius · July 11, 2020, 6:33pm

A post was split to a new topic: Is Rarified Data appropriate for UniFrac?