Correct interpretation of library size in amplicon/metagenomic sequencing

Hi the Qiime2 community,

I have a general question regarding to the interpretation of library size (sequencing depth) because I am confused or puzzled with the potential mis-interpretation in microbiome literature.

To make it simple, let's just define the library size as the sequencing depth obtained for each sample in a 16S amplicon sequencing project using standard EMP protocol. My understanding is due to the library preparation protocol, all the library for individual samples were normalized into the same molar amount prior to being pooled and sequenced, the resulting library size would not carry any biologically meaningful information regarding to the original sample. It is true that the library sizes vary from sample to sample, sometimes the variability is huge, and sometimes small. But my understanding is any variation in library sizes only reflects either the pooling efforts or sequencing efforts.

However, when reading literatures and occasionally discussion with collaborators, I found library sizes are often used as a way to measure the "biomass loads" of the original samples. For example, in the ANCOM-BC paper, it appears the authors are using library sizes to estimate the "sampling fraction", which probably is the observed absolute fraction/the unobserved total absolute abundances (I do think sampling fraction is a really important aspect that we don't discuss often). They showed that ANCOM-BC had a very good normalization that minimizes this biases due to sampling fraction. I am not sure if the idea of "bias correction" using library sizes is valid here.

Also it worth noting that numerous studies have shown that library size, especially the variability of library size (e.g. Nearing et al. and Weiss et al. ), does have impact on downstream analyses.

Thank you!


Hi @wangj50 ,

You are absolutely correct.

No, library size cannot and should not be used as a proxy for biomass in amplicon or metagenome sequencing (at least when using standard protocols). Hence the need for a growing amount of literature on ways to estimate absolute abundances (e.g., by integrating these data with QPCR, flow cytometry, or internal standards).

As you say, equimolar pooling etc is done in most cases to balance the inputs, so differences in sequencing read depth across samples is due to other issues downstream and does not relate to biomass. Even without equimolar pooling, there are too many other factors that impact the number of observations (e.g., amplification efficiency, DNA extraction efficiency and capacity).

Your points about sampling fraction in ANCOM-BC and library size effects on differential abundance testing, diversity measurements, etc, are valid points — and yes sampling depth effects all of these and must be controlled for — but this is rather due to statistical bias and is different from absolute abundances of inputs.

I hope that helps!


Thanks, Nick!

Glad we are on the same page and hopefully this can serve as a reference point when people having this kind of discussions.

Regarding ANCOM-BC, in my opinion, although the point of sampling fraction is valid, the way they are framing the issue as estimating the sampling fraction using library size is incorrect given the above discussions. Library size does not reflect any sampling regarding to the original environment/library.

However, I think, what described in ANCOM-II (for whatever reason, they are two different methods) paper is somewhat valid; that is library size is used as a random effect factor to adjust for the log-ratio transformed count using a certain reference-frame. And it looks like the methods used to do this adjustment in ANCOM-II is very similar to the "bias correction" method in ANCOM-BC.

1 Like