Batch effects - Normalizing sequencing runs with different sequence counts

Diego_Yusta · August 20, 2025, 6:15pm

Hi all,

This is my first time posting but these forums have been an immense help with my sequencing analysis.

For context, I am working with two sets of sequences from soil DNA extractions taken at two time points. Both of these sequencing runs were done the same, but one of them was done on its own, and the other was done with a number of other projects. Because of this, one run has a higher sequencing depth, and my sequence counts are drastically different. Looking at my feature table visualizations:

Sequencing run 1 has a min feature frequency of ~50,000

Sequencing run 2 has a min feature frequency of ~700,000

I have a low sample count, so I want to keep as many samples as I can. If I normalize this to the min sequence count in my second sequencing run, I will be losing out on hundreds of thousands of features worth of information in my second run.

My question is: Can I analyze these together? I figure the run with more features will automatically be capturing more diversity which will affect my analysis. In hindsight, of course both of these sample sets should have been sequenced together.

Excited to hear your thoughts!

ebolyen · August 20, 2025, 7:20pm

I think what I would do to set my mind at ease about the high depth, is run rarefaction and see how much diversity you actually lose in run 2 when you get closer to the depth of run 1. If it plateaus early, then you may not be losing too much information by rarefying. But that's really only interesting if you want to use alpha diversity, for other analysis methods, you can avoid rarefying:

Compositional approaches are insensitive to the sampling-depth (outside of some bias from the pseudocount). These are usually for differential abundance, but there is Aitchison distance for beta-diversity.

If you don't want to rarefy at all but still need alpha diversity, you could look at DivNet, which should convert the low read-depth into a wider confidence interval.

colinbrislawn · August 20, 2025, 8:22pm

The Earth Microbiome Project was

carried out using a modest sequencing depth of 5,000 observations per sample

and they published in Nature. A communal catalogue reveals Earth’s multiscale microbial diversity | Nature

To investigate how prevalence estimates were affected by sequencing depth, we focused on four major environment types for which we had the greatest number of samples with more than 50,000 observations

So their high-depth samples have a minimum sequencing depth the same as your low-depth samples.

I wouldn't worry!

I will be losing out on hundreds of thousands of features worth of information in my second run.

Bootstrapping uses all the data and there's a new plugin that does just that!

Diego_Yusta · August 20, 2025, 8:51pm

@colinbrislawn and @ebolyen you both rock! Thanks a million