SRS: Alternative tool for library size normalisation

Hi everyone,

I am using QIIME 2 since 2017, spend countless days with reading in the QIIME 2 forum, and highly appreciate the community support! Moreover, I am an advocate of Open Science and, thus, feel at home in the QIIME 2 community. I thought it might be the right time to get a QIIME 2 forum account and do my very first forum post :wave::slightly_smiling_face:

I would like to introduce an algorithm to the QIIME 2 community that we developed for the normalisation of microbiome count data. Ever since the classic paper of McMurdie & Holmes in 2014, we have the statistical proof that rarefying (random subsampling without replacement) should never be used for library size normalization (many researchers are still using it though). It was great to see that QIIME 2 is not using random OTU picking without replacement!

Our new algorithm for the normalisation of microbiome count data is called ‘scaling with ranked subsampling’ (SRS). Here is a simplified illustration of how SRS works:

We normalised a bacterial 16S library to different library sizes using rarefying and SRS and compared the Bray-Curtis index of dissimilarity among 10,000 repeats. We found what we already expected from the random OTU picking of rarefying and showed that SRS is highly reproducible:

You can find all the details of SRS here: https://doi.org/10.7717/peerj.9593
An implementation of SRS in R is available for here: https://doi.org/10.20387/BONARES-2657-1NP3

You will see that SRS also uses random subsampling, however, a complex combination of circumstances has to occur for random subsampling to be used (it rarely happens). Additionally, if random subsampling is used, the relative abundance of the affected OTUs will be vary by at most a single count, which is mathematically inevitable. I am aware that our study is not as comprehensive as the study by Weiss et al. 2017, however, I believe that SRS is a suitable and promising tool for the normalisation of microbiome count data. I would be happy to discuss different library size normalisation approaches and the potential advantages/disadvantages of SRS with the QIIME 2 community. Depending on your feedback, I may try to develop a QIIME 2 plugin for SRS :computer:

Cheers and keep safe :v:

Lukas

3 Likes

Thanks for the overview @lukasbeule,

I’m trying to better understand how to examine the plot showing the Bray-Curtis dissimilarity vs. Number of counts. Your paper describes using a library consisting of 60 samples, and I think? you examined 39 different Cmin values. For every one of those distinct Cmin values, you performed 10,000 replications - is that right? So when I look at a given point on the figure, each point represents one of those Cmin values, and the error bars represent the variation observed in the Bray-Curtis distance for each of the 10,000 replicates?

When I first looked at that figure, I said to myself “well then just make sure you have over 100,000 sequences per sample and you’ll be fine even if you rarefy”… but of course you never get perfectly balanced throughput per sample, and I’m pretty sure you illustrate this in Figure 2D in your paper, correct? My takeaway from Figure 2D was that your SRS method avoids the issue of discarding rare variants _because of differences in throughput between samples - is that fair?

1 Like

Hi @devonorourke thanks for your questions! I hope that I can answer all of them.

I’m trying to better understand how to examine the plot showing the Bray-Curtis dissimilarity vs. Number of counts. Your paper describes using a library consisting of 60 samples , and I think? you examined 39 different Cmin values . For every one of those distinct Cmin values, you performed 10,000 replications - is that right? So when I look at a given point on the figure, each point represents one of those Cmin values, and the error bars represent the variation observed in the Bray-Curtis distance for each of the 10,000 replicates?

We sequenced a dataset of 60 samples, processed them in :qiime2: and selected one sample for our simulation study. We did this to obtain a zero-inflated microbiome sample. And yes, we normalised this sample to 39 different library sizes (Cmin) and each point represents the the mean of one Cmin value. Yes, we performed 10,000 replications (which occupied approx. 42 GB RAM and took almost an entire day :grinning:). We then calculated the Bray-Curtis dissimilarity among all 10,000 replications (so the normalised libraries) at each Cmin. The error bars range from the minimum to the maximum Bray-Curtis dissimilarity among all 10,000 replications. For example, for rarefy, some of the 10,000 replications did not share a single OTU and the Bray-Curtis index hit 1. This means when we run rarefy more than once on the same dataset, we can produce absolutely different normalised libraries. That’s why some people recommend setting a random seed (set.seed() in R) before running rarefy, which I believe is simply a way of getting a reproducible bias.

When I first looked at that figure, I said to myself “ well then just make sure you have over 100,000 sequences per sample and you’ll be fine even if you rarefy ”… but of course you never get perfectly balanced throughput per sample, and I’m pretty sure you illustrate this in Figure 2D in your paper, correct? My takeaway from Figure 2D was that your SRS method avoids the issue of discarding rare variants _because of differences in throughput between samples - is that fair?

Correct - it depends on the diversity parameter that you’re looking at (e.g., Shannon div. and species richness behave differently).
Here’s a little thought experiment: imagine you obtained 1,000 sequence counts for a sample. 500 of these counts belong to species A, the other 500 counts belong to 250 different species with each of these species having 2 counts. If you now want to normalise to 500 counts, species A should have 250 counts and all the other species should have 1 count (this is actually what SRS would produce in this scenario :wink:). When you use rarefy, however, there is a scenario in which your normalised sample will contain only species A. Statistically, this scenario is highly unlikely, however, it still exists and will eventually happen. Furthermore, it is less about the total number of sequence counts but rather about the difference among your samples. For example, if your sequencing depth differs by factor 2, the chances to have a significant bias using rarefy are fairly low. However, if you normalise across orders of magnitude in sequencing depth (which can happen), you are very likely to end up with a bias using rarefy. This was the initial motivation to develop SRS.

Unfortunately, we did not analyse the discarding of rare variants but based on our diversity analysis and on the figure that I will show below, SRS on average perseveres rare variants better than rarefy. Here’s what we did: we multiplied our sample by 100 (counts x 100; referred to as ‘artificially raised counts’), which of course did not change our alpha diveristy indices. We then normalised our sample with artificially raised counts using rarefy and SRS to different Cmin that were ≥ the counts of the sample without artificially raised counts (Figure S1; see below). Again, everything with 10,000 replications. This time, however we knew the ‘original alpha diversity’ since it did not change by artificially raising the counts. In other words, we knew which alpha diversity we should expect (the dashed black line). To make an extremely long story short, the main idea behind SRS was to mimic a lower sequencing depth.

Figure S1

3 Likes

SRS is now available on CRAN!
https://CRAN.R-project.org/package=SRS

Next, I will try to develop a QIIME 2 plugin and will probably ask for community support during development :computer: :qiime2:

1 Like