SRS: Alternative tool for library size normalisation

lukasbeule · August 8, 2020, 2:24pm

Hi everyone,

I am using QIIME 2 since 2017, spend countless days with reading in the QIIME 2 forum, and highly appreciate the community support! Moreover, I am an advocate of Open Science and, thus, feel at home in the QIIME 2 community. I thought it might be the right time to get a QIIME 2 forum account and do my very first forum post

I would like to introduce an algorithm to the QIIME 2 community that we developed for the normalisation of microbiome count data. Ever since the classic paper of McMurdie & Holmes in 2014, we have the statistical proof that rarefying (random subsampling without replacement) should never be used for library size normalization (many researchers are still using it though). It was great to see that QIIME 2 is not using random OTU picking without replacement!

Our new algorithm for the normalisation of microbiome count data is called ‘scaling with ranked subsampling’ (SRS). Here is a simplified illustration of how SRS works:

We normalised a bacterial 16S library to different library sizes using rarefying and SRS and compared the Bray-Curtis index of dissimilarity among 10,000 repeats. We found what we already expected from the random OTU picking of rarefying and showed that SRS is highly reproducible:

You can find all the details of SRS here: Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities [PeerJ]
An implementation of SRS in R is available for here: BonaRes Repository

You will see that SRS also uses random subsampling, however, a complex combination of circumstances has to occur for random subsampling to be used (it rarely happens). Additionally, if random subsampling is used, the relative abundance of the affected OTUs will be vary by at most a single count, which is mathematically inevitable. I am aware that our study is not as comprehensive as the study by Weiss et al. 2017, however, I believe that SRS is a suitable and promising tool for the normalisation of microbiome count data. I would be happy to discuss different library size normalisation approaches and the potential advantages/disadvantages of SRS with the QIIME 2 community. Depending on your feedback, I may try to develop a QIIME 2 plugin for SRS

Cheers and keep safe

Lukas

devonorourke · August 8, 2020, 3:09pm

Thanks for the overview @lukasbeule,

I'm trying to better understand how to examine the plot showing the Bray-Curtis dissimilarity vs. Number of counts. Your paper describes using a library consisting of 60 samples, and I think? you examined 39 different Cmin values. For every one of those distinct Cmin values, you performed 10,000 replications - is that right? So when I look at a given point on the figure, each point represents one of those Cmin values, and the error bars represent the variation observed in the Bray-Curtis distance for each of the 10,000 replicates?

When I first looked at that figure, I said to myself "well then just make sure you have over 100,000 sequences per sample and you'll be fine even if you rarefy"... but of course you never get perfectly balanced throughput per sample, and I'm pretty sure you illustrate this in Figure 2D in your paper, correct? My takeaway from Figure 2D was that your SRS method avoids the issue of discarding rare variants _because of differences in throughput between samples - is that fair?

lukasbeule · August 8, 2020, 7:05pm

Hi @devonorourke thanks for your questions! I hope that I can answer all of them.

I’m trying to better understand how to examine the plot showing the Bray-Curtis dissimilarity vs. Number of counts. Your paper describes using a library consisting of 60 samples , and I think? you examined 39 different Cmin values . For every one of those distinct Cmin values, you performed 10,000 replications - is that right? So when I look at a given point on the figure, each point represents one of those Cmin values, and the error bars represent the variation observed in the Bray-Curtis distance for each of the 10,000 replicates?

We sequenced a dataset of 60 samples, processed them in :qiime2: and selected one sample for our simulation study. We did this to obtain a zero-inflated microbiome sample. And yes, we normalised this sample to 39 different library sizes (Cmin) and each point represents the the mean of one Cmin value. Yes, we performed 10,000 replications (which occupied approx. 42 GB RAM and took almost an entire day ). We then calculated the Bray-Curtis dissimilarity among all 10,000 replications (so the normalised libraries) at each Cmin. The error bars range from the minimum to the maximum Bray-Curtis dissimilarity among all 10,000 replications. For example, for rarefy, some of the 10,000 replications did not share a single OTU and the Bray-Curtis index hit 1. This means when we run rarefy more than once on the same dataset, we can produce absolutely different normalised libraries. That's why some people recommend setting a random seed (set.seed() in R) before running rarefy, which I believe is simply a way of getting a reproducible bias.

When I first looked at that figure, I said to myself “ well then just make sure you have over 100,000 sequences per sample and you’ll be fine even if you rarefy ”… but of course you never get perfectly balanced throughput per sample, and I’m pretty sure you illustrate this in Figure 2D in your paper, correct? My takeaway from Figure 2D was that your SRS method avoids the issue of discarding rare variants _because of differences in throughput between samples - is that fair?

Correct - it depends on the diversity parameter that you're looking at (e.g., Shannon div. and species richness behave differently).
Here's a little thought experiment: imagine you obtained 1,000 sequence counts for a sample. 500 of these counts belong to species A, the other 500 counts belong to 250 different species with each of these species having 2 counts. If you now want to normalise to 500 counts, species A should have 250 counts and all the other species should have 1 count (this is actually what SRS would produce in this scenario ). When you use rarefy, however, there is a scenario in which your normalised sample will contain only species A. Statistically, this scenario is highly unlikely, however, it still exists and will eventually happen. Furthermore, it is less about the total number of sequence counts but rather about the difference among your samples. For example, if your sequencing depth differs by factor 2, the chances to have a significant bias using rarefy are fairly low. However, if you normalise across orders of magnitude in sequencing depth (which can happen), you are very likely to end up with a bias using rarefy. This was the initial motivation to develop SRS.

Unfortunately, we did not analyse the discarding of rare variants but based on our diversity analysis and on the figure that I will show below, SRS on average perseveres rare variants better than rarefy. Here's what we did: we multiplied our sample by 100 (counts x 100; referred to as 'artificially raised counts'), which of course did not change our alpha diveristy indices. We then normalised our sample with artificially raised counts using rarefy and SRS to different Cmin that were ≥ the counts of the sample without artificially raised counts (Figure S1; see below). Again, everything with 10,000 replications. This time, however we knew the 'original alpha diversity' since it did not change by artificially raising the counts. In other words, we knew which alpha diversity we should expect (the dashed black line). To make an extremely long story short, the main idea behind SRS was to mimic a lower sequencing depth.

Figure S1
Screenshot 2020-08-08 at 17.57.47

lukasbeule · September 2, 2020, 4:43pm

SRS is now available on CRAN!
https://CRAN.R-project.org/package=SRS

Next, I will try to develop a QIIME 2 plugin and will probably ask for community support during development :qiime2:

devonorourke · September 25, 2020, 2:31pm

Hey @lukasbeule,
I wanted to use SRS with a pile of bat guano data I'm about to dive into. I didn't have any issue in running the R commands, but wanted to check with you about best practices, and things I might want to consider when running this new (to me!) type of data normalization technique.

I've posted the R code I applied below. Two quick questions about this kind of strategy:

If I was going to normalize the data via rarefaction, I'd run some sort of rarefaction plot to figure out what the tradeoff would be between losing samples below a minimum read depth, and losing features (ASVs/OTUs) as I increased sampling depths. In the case of SRS, what kind of tradeoff exists? In particular, is there a minimum sampling depth you would recommend not going below?
Once I've retained the SRS output, I noticed that the row.names were dropped. Those were the featureIDs, and obviously I need those back ! Am I correct in assuming that the rows are not resorted between the input/output data.frames when the SRS function is applied? In other words, can I just add those row.names back in using the original input file to SRS?

Thanks for your input,
Devon

library(SRS)
library(tidyverse)
library(reshape2)


##import reads
# download.file(destfile = "~/Desktop/nhASV.qza",
#               url = "https://github.com/devonorourke/nhguano/raw/master/data/qiime_qza/ASVtable/sampleOnly_table.qza")

nhguano_qza <- qiime2R::read_qza("~/Desktop/nhASV.qza")
raw_data_wide <- as.data.frame(nhguano_qza$data)
raw_data_wide$OTUid <- rownames(raw_data_wide)
rownames(raw_data_wide) <- NULL
raw_data_long <- melt(raw_data_wide, id = "OTUid") %>% filter(value != 0)
colnames(raw_data_long) <- c("ASVid", "SampleID", "Reads")
rm(nhguano_qza, raw_data_wide)
## which samples contain at least 2000 reads total?
goodSamps <- raw_data_long %>% 
  group_by(SampleID) %>% 
  summarise(SampleSeqDepth = sum(Reads)) %>% 
  filter(SampleSeqDepth >= 2000) %>% 
  select(SampleID) %>% 
  pull()
## retain only those samples:
filt_data_long <- raw_data_long %>% filter(SampleID %in% goodSamps)
## reformat as OTU matrix for SRS:
filt_data_wide <- dcast(data = filt_data_long, formula = ASVid ~ SampleID, value.var='Reads', fill = 0)
row.names(filt_data_wide) <- filt_data_wide$ASVid
filt_data_wide$ASVid <- NULL
filt_data_wide <- as.data.frame(filt_data_wide)

## get Cmin
Cmin <- min(colSums(filt_data_wide))

## run SRS
SRS_output <- SRS(filt_data_wide, Cmin)

lukasbeule · September 28, 2020, 8:32am

Hi @devonorourke,
great that you are exploring SRS and thanks for sharing your with us!

If I was going to normalize the data via rarefaction, I’d run some sort of rarefaction plot to figure out what the tradeoff would be between losing samples below a minimum read depth, and losing features (ASVs/OTUs) as I increased sampling depths. In the case of SRS, what kind of tradeoff exists? In particular, is there a minimum sampling depth you would recommend not going below?

This question reminds of a conversation that I had earlier with @vheidrich. He modified the 'rare curve'-function from the 'vegan'-package for SRS and dubbed it 'SRScurve'. It plots the species richness against the sample size (Cmin). We now teamed up and will soon release an update of the 'SRS'-package that includes a finalised version of his function. We will also expend this function to not just species richness but also other alpha-diversity indices such as Shannon index.
Here it's done for the first 3 samples of your dataset using @vheidrich's preliminary 'SRScurve'-function. We also plotted the 'rarecurve'-function for comparison (R code at the end of this message, just add this to your script):

You will immediately notice a zigzag behaviour of the SRS curve

. This results from the combination of scaling and ranked subsampling of the fractional part (Cfrag). However, first thing I did of course was a validation of our algorithm and it's not caused by an error in the SRS algorithm itself. Currently, we don't know any mathematically justifiable solution to get a steady decay in species richness with decreasing Cmin. However, as we have shown in our SRS paper as well as in the figure above, compared to rarefy, SRS keeps more species because random subsampling without replacement (even if performed repeatedly) will discriminate against species with low count numbers. Therefore, I am OK with having species counts that are not constantly decaying

.
I cannot make recommendations regarding minimum sampling depth because it always depends on your data and your goals of analysis. However, what @vheidrich can tell you from his experience is that your minimum sampling depth using SRS is likely to be lower than a reasonable 'rarefaction cutoff' and I believe the figure above illustrates this.

Once I’ve retained the SRS output, I noticed that the row.names were dropped. Those were the featureIDs, and obviously I need those back ! Am I correct in assuming that the rows are not resorted between the input/output data.frames when the SRS function is applied? In other words, can I just add those row.names back in using the original input file to SRS?

Very good point! We did not apply any resorting to the SRS output: you can simply add the original row.names back to the output file. We will explicitly mention this in the updated version of our SRS-package, thanks for your input !

Cheers,

Lukas

library(vegan)
SRSnspec<-function(data,Cmin){
  nspec=as.data.frame(matrix(data=NA,nrow = ncol(data),ncol=length(Cmin)))
  colnames(nspec) <- paste("N", Cmin, sep="")
  rownames(nspec) <- colnames(data)
  for (i in seq(1,ncol(data),1)){
    j=1
    for (sample in Cmin){
      nspec[i,j]=length(which(SRS(data = as.data.frame(data[,i]), Cmin = sample)>0))
      j=j+1
    }
  }
  attr(nspec, "Subsample") <- Cmin
  return(nspec)
}
#plot the number of observed species per sequencing depth/sample size
#it works very much like "rarecurve" function from vegan
SRScurve <- function(x, step = 1, sample, xlab = "Sample Size", ylab = "Species",
           label = TRUE, col, lty, ...) {
    ## matrix is faster than data.frame
    x <- as.matrix(x)
    ## check input data: must be counts
    if (!identical(all.equal(x, round(x)), TRUE))
      stop("function accepts only integers (counts)")
    ## sort out col and lty
    if (missing(col))
      col <- par("col")
    if (missing(lty))
      lty <- par("lty")
    tot <- rowSums(x)
    S <- specnumber(x)
    ## remove empty rows or we fail
    if (any(S <= 0)) {
      message("empty rows removed")
      x <- x[S > 0,, drop =FALSE]
      tot <- tot[S > 0]
      S <- S[S > 0]
    }
    nr <- nrow(x)
    ## rep col and lty to appropriate length
    col <- rep(col, length.out = nr)
    lty <- rep(lty, length.out = nr)
    ## Rarefy
    out <- lapply(seq_len(nr), function(i) {
      n <- seq(1, tot[i], by = step)
      if (n[length(n)] != tot[i])
        n <- c(n, tot[i])
      drop(SRSnspec(as.data.frame(x[i,]),Cmin = n)) #using the function declared above
    })
    Nmax <- sapply(out, function(x) max(attr(x, "Subsample")))
    Smax <- sapply(out, max)
    ## set up plot
    plot(c(1, max(Nmax)), c(1, max(Smax)), xlab = xlab, ylab = ylab,
         type = "n", ...)
    ## rarefied richnesses for given 'sample'
    if (!missing(sample)) {
      abline(v = sample)
      rare <- sapply(out, function(z) approx(x = attr(z, "Subsample"), y = z,
                                             xout = sample, rule = 1)$y)
      abline(h = rare, lwd=0.5)
    }
    ## rarefaction curves
    for (ln in seq_along(out)) {
      N <- attr(out[[ln]], "Subsample")
      lines(N, out[[ln]], col = col[ln], lty = lty[ln], ...)
    }
    ## label curves at their endpoitns
    if (label) {
      ordilabel(cbind(tot, S), labels=rownames(x), ...)
    }
    return(out)
  }

rarecurve(x = t(filt_data_wide[1:3]), step = 1,col=c("#000000", "#E69F00", "#56B4E9"), las=1, label = F, xlim =c(0,8000))
axis(side=2, at=seq(0,130 ,by=1), labels=NA, tck=-0.015)
par(new=T)
SRScurve(x = t(filt_data_wide[1:3]), step = 1, col = c("#000000", "#E69F00", "#56B4E9"), las=1,label = F, xlim =c(0,8000))

devonorourke · September 28, 2020, 3:35pm

Wonderful stuff - thanks for the detailed reply @lukasbeule,
I would imagine the greatest variance between the SRS and rarefaction values we get will occur with observed richness, as you show in the figure. I'm guessing relative differences for SRS and rarefaction when extending this concept to Shannon's entropy will be lower? I'm excited to give it a try whenever the new version is available.

If it's a rewrite of vegan, I wonder if you're adapting the renyi function? If you went the route of using Hill Numbers in your SRS package, it'd provide the user with a very flexible method to explore how species richness estimates change with varying the importance of the sequence abundances. You can always quickly get the classic Observed Richness, Shannon's Entropy, and Simpson's diversity values, for example, by plugging in Hill Numbers of 0, 1, and 2. The neat thing about the Hill Number approach, in my opinion, is you're not stuck with just those integer values. You can use a number of 0.25, or 1.75, for example. Then you get a whole spectrum of plots to worry about

Keep up the great work!

vheidrich · October 5, 2020, 5:08pm

Hi @devonorourke!

The new version of the SRS R package is out. As @lukasbeule mentioned earlier, this new version features the function SRScurve, which builds rarefaction plots for SRS normalized data.

Current SRScurve version works for richness, Shannon, Simpson and Inverse Simpson indices, but we will certainly consider your suggestion on using Hill numbers in the next package update.

Now, let's remake @lukasbeule's last plot (first 3 samples of your dataset) using SRScurve with 'metric=shannon' and compare SRS with rarefying by setting 'rarefy.comparison = T':

Finally, by using the 'sample' parameter you may evaluate the difference in diversity estimates for SRS and rarefy at specific sampling depths. Here is an example with 'metric = richness' and 'sample = 2500':

This anecdotal yet useful example shows that only through SRS normalization would be possible to capture 100% of the yellow sample richness while keeping the black sample in the dataset.

You can reproduce those plots with the following R code:

install.packages("SRS")
library(SRS)
#first plot
SRScurve(x = filt_data_wide[1:3], metric = "shannon", step = 100,
rarefy.comparison = T, rarefy.comparison.legend = T, ylab = "Shannon",
col = c(rep("#000000",2), rep("#E69F00",2), rep("#56B4E9",2)), lty = c("solid","longdash"), xlim = c(0,8000))
#second plot
SRScurve(x = filt_data_wide[1:3], metric="richness", step = 100, sample = 2500,
rarefy.comparison = T, rarefy.comparison.legend = T, ylab = "ASVs",
col = c(rep("#000000",2), rep("#E69F00",2), rep("#56B4E9",2)), lty = c("solid","longdash"), xlim = c(0,8000))

Please take a look at the SRScurve documentation for further details and options.

Hope it is useful! Thank you for the comments and suggestions

Cheers,

Vitor

devonorourke · October 9, 2020, 3:51pm

Awesome work @vheidrich and @lukasbeule,

I'm still working on trying to figure out the best way to determine a minimal sampling depth for a sample set that is large (100's of samples). Plotting this with the new SRScurve function is certainly possible, but viewing that many lines on a single plot is a bit like when my 5 year old daughter decides to show me her latest treasure map... it looks like someone sat on a plate of spaghetti ...

I wonder if there is a way to interrogate the data in a way that produces a visualization similar to the effect of the output of qiime feature-table summarize:

I'm guessing that a Shiny app is the obvious candidate for taking this kind of data in R and making it interactive, but it's been a while since I've tried making interactive plots in R, and perhaps there are better methods already available.

It's tricky - the current interactive framework with QIIME lets you determine how sampling depth changes the absolute number of features retained, but it doesn't give you a sense of how many features are retained per sample (that's what the visualization from alpha-rarefaction does though!). I suppose what I'm thinking of building is a function just like your SRScurve function, but instead of plotting the results visually, I'd be able to retain them in a table format where I could have a button or slide bar to manipulate and select a given sampling depth, and report back how many OTUs (or any other alpha metric) are retained in each sample.

Thanks for your great work and recent advancements with the SRS package!

Cheers

lukasbeule · October 10, 2020, 3:49pm

@devonorourke thank you very much for your thoughts and suggestions!
You again gave @vheidrich and me lots of new ideas to think about.

We agree that a shiny app would be great! Currently, we are quite busy with other stuff -- my dissertation submission deadline is approaching -- but we will soon continue to add new features to the SRS package and hopefully get a q2-plugin for SRS done in the near future. Let's keep in touch

-- Lukas

devonorourke · October 22, 2020, 5:19pm

Hi @lukasbeule,
Could you confirm something for me:
should I expect the same output every time I run SRS without running set.seed() in the R environment first? Just wondering for the sake of reproducibility.
Thank you!

lukasbeule · October 22, 2020, 6:08pm

Hi @devonorourke!

great question. Reproducibility is the key idea behind SRS . You can only get variation among SRS runs if it uses random OTU/ASV picking. Let's start with the long answer first:

"The use of random subsampling in SRS is limited to a fraction of counts that have to be added to the sum of counts scaled and rounded down to integers in order to reach the desired library size. A complex combination of circumstances has to occur for random subsampling to be used in SRS: several OTUs have to share both the integer part and the decimal fraction of their scaled frequencies; in a list of OTUs ranked by frequencies, these OTUs have to appear before the desired total number of counts is reached; and the number of counts that is needed to fill the normalized library is lower than the number of these OTUs. As long as at least one of these conditions is not fulfilled, SRS does not use random subsampling with replacement and replicates of the normalized library are identical. [...] If random subsampling is used, the relative abundance of the affected OTUs will be vary by at most a single count." (original SRS publication)

Therefore, the short answer is: it rarely happens, and if it happens, it's effect is negligible. Thus, I am not running it with set.seed(). However, it doesn't hurt to use a pseudorandom number generator .

By the way, I was thinking about set.seed() a lot recently. Running rarefying with set.seed() often serves as a justification for random subsampling without replacement. In terms of reproducibility, this is correct; however, in my opinion, this simply enables the reproducibility of the bias that comes with rarefying. This was one of my main motivations for SRS.

devonorourke · October 22, 2020, 6:35pm

Thanks @lukasbeule,

Appreciate the detailed reply. In the last sentence from your SRS publication...

... it makes me wonder: is there a down side to adding in a default seed within SRS? Would it slow any part of the program? If not, perhaps it further ensures reproducibility without any computational cost? Certainly the user is always able to pass that set.seed() argument prior to executing the SRS function, but I'm always appreciative of programs that have a default built in.

Cheers!

lukasbeule · October 23, 2020, 5:00am

Hi @devonorourke,

it wouldn't slow down the algorithm. For implementation in SRS, @vheidrich and me would simply place set.seed(seed) before the random step within the SRS algorithm and define seed = 1 as the default setting in the SRS-function. That way users are able to adjust the default seed. I will discuss this with my colleagues today but it seems like we will soon update SRS! Thank you very much for your suggestion @devonorourke. We will keep you updated!

lukasbeule · November 2, 2020, 6:18am

We updated the SRS package.
The new release (version 0.1.2) includes a build-in set.seed in the SRS-function.
The updated SRS-function is:

Usage

SRS(data, Cmin, set_seed = TRUE, seed = 1)

Arguments

data: Data frame (species count or OTU table) in which columns are samples and rows are the counts of species or OTUs. Only integers are accepted as data.

Cmin: The number of counts to which all samples will be normalized. Cmin must be lower than or equal to the total OTU count of any sample. Typically, the total OTU count of the sample with the lowest sequencing depth is chosen as Cmin.

set_seed: Logical, if TRUE, a seed is set to enable reproducibility of SRS if OTUs with identical Cfrag as well as Cint are sampled randomly without replacement. See set.seed for details. Default is TRUE.

seed: Integer, specifying the seed. See set.seed for details. Default is 1.

A Shiny app for SRS will very soon be available and the development of a QIIME 2 plugin will soon start!

lukasbeule · November 17, 2020, 11:36am

We updated the SRS R package once again

The current release (version 0.2.1) includes a Shiny app (SRS.shiny.app-function) that can be used for the determination of Cmin.

don't use version 0.2.0 of the SRS package as it contains a bug that we removed

This is the interface of the Shiny app:

The Shiny app was mainly written by @vheidrich with suggestions by @devonorourke and myself.

The app allows users to choose among four different diversity indices (species richness, Shannon diversity, Simpson index, inverse Simpson index), the sampling depth (Cmin), the step size for SRS curves as well as the maximum sample size for which SRS curves will be plotted (this safes computation time). We tried to design the app as intuitive as possible; however, feel free to contact us (@vheidrich or me) in case you have any questions.

Finally, @vheidrich is currently working on a QIIME 2 plugin for SRS that will be available soon.

vheidrich · December 1, 2020, 8:33pm

Hi,

The q2-srs plugin is already available! Take a look at the plugin announcement topic for a brief tutorial on how to use SRS within the QIIME2 environment.

Cheers,

Vitor