uneven coverage of sequencing and rarefactions

colinbrislawn · April 5, 2023, 5:21pm

Great! Let's hear what they have to say.

You can do this, but there is going to be a tradeoff between keeping more samples and keeping more data in each sample.

The issue is that samples with few reads have lower resolution. (Just like a photo with fewer pixels has a lower resolution.)

One common method of normalization involves subsampling, like to 10k reads per sample as you mentioned. But what do you do with samples that have less than 10k reads? There is no way to increase resolution that you do not have, so many normalization pipelines simply drop these samples from the normalized output.

This is the tradeoff:

keep all samples, removing resolution from the deeply sequenced samples so all are comparable
keep just deeply sequenced samples, removing samples that have fewer reads

There is no way to do both.

For the messy, academic debate about this tradeoff, see these two papers:
Why subsampling is (always!) bad: Waste not, want not: why rarefying microbiome data is inadmissible - PubMed
Why subsampling is (often!) fine: Normalization and microbial differential abundance strategies depend upon data characteristics - PMC