Hi @devonorourke thanks for your questions! I hope that I can answer all of them.
I’m trying to better understand how to examine the plot showing the Bray-Curtis dissimilarity vs. Number of counts. Your paper describes using a library consisting of 60 samples , and I think? you examined 39 different Cmin values . For every one of those distinct Cmin values, you performed 10,000 replications - is that right? So when I look at a given point on the figure, each point represents one of those Cmin values, and the error bars represent the variation observed in the Bray-Curtis distance for each of the 10,000 replicates?
We sequenced a dataset of 60 samples, processed them in and selected one sample for our simulation study. We did this to obtain a zero-inflated microbiome sample. And yes, we normalised this sample to 39 different library sizes (Cmin) and each point represents the the mean of one Cmin value. Yes, we performed 10,000 replications (which occupied approx. 42 GB RAM and took almost an entire day ). We then calculated the Bray-Curtis dissimilarity among all 10,000 replications (so the normalised libraries) at each Cmin. The error bars range from the minimum to the maximum Bray-Curtis dissimilarity among all 10,000 replications. For example, for rarefy, some of the 10,000 replications did not share a single OTU and the Bray-Curtis index hit 1. This means when we run rarefy more than once on the same dataset, we can produce absolutely different normalised libraries. That’s why some people recommend setting a random seed (set.seed() in R) before running rarefy, which I believe is simply a way of getting a reproducible bias.
When I first looked at that figure, I said to myself “ well then just make sure you have over 100,000 sequences per sample and you’ll be fine even if you rarefy ”… but of course you never get perfectly balanced throughput per sample, and I’m pretty sure you illustrate this in Figure 2D in your paper, correct? My takeaway from Figure 2D was that your SRS method avoids the issue of discarding rare variants _because of differences in throughput between samples - is that fair?
Correct - it depends on the diversity parameter that you’re looking at (e.g., Shannon div. and species richness behave differently).
Here’s a little thought experiment: imagine you obtained 1,000 sequence counts for a sample. 500 of these counts belong to species A, the other 500 counts belong to 250 different species with each of these species having 2 counts. If you now want to normalise to 500 counts, species A should have 250 counts and all the other species should have 1 count (this is actually what SRS would produce in this scenario ). When you use rarefy, however, there is a scenario in which your normalised sample will contain only species A. Statistically, this scenario is highly unlikely, however, it still exists and will eventually happen. Furthermore, it is less about the total number of sequence counts but rather about the difference among your samples. For example, if your sequencing depth differs by factor 2, the chances to have a significant bias using rarefy are fairly low. However, if you normalise across orders of magnitude in sequencing depth (which can happen), you are very likely to end up with a bias using rarefy. This was the initial motivation to develop SRS.
Unfortunately, we did not analyse the discarding of rare variants but based on our diversity analysis and on the figure that I will show below, SRS on average perseveres rare variants better than rarefy. Here’s what we did: we multiplied our sample by 100 (counts x 100; referred to as ‘artificially raised counts’), which of course did not change our alpha diveristy indices. We then normalised our sample with artificially raised counts using rarefy and SRS to different Cmin that were ≥ the counts of the sample without artificially raised counts (Figure S1; see below). Again, everything with 10,000 replications. This time, however we knew the ‘original alpha diversity’ since it did not change by artificially raising the counts. In other words, we knew which alpha diversity we should expect (the dashed black line). To make an extremely long story short, the main idea behind SRS was to mimic a lower sequencing depth.