Hi @devonorourke,
great that you are exploring SRS and thanks for sharing your with us!
- If I was going to normalize the data via rarefaction, I’d run some sort of rarefaction plot to figure out what the tradeoff would be between losing samples below a minimum read depth, and losing features (ASVs/OTUs) as I increased sampling depths. In the case of SRS, what kind of tradeoff exists? In particular, is there a minimum sampling depth you would recommend not going below?
This question reminds of a conversation that I had earlier with @vheidrich. He modified the 'rare curve'-function from the 'vegan'-package for SRS and dubbed it 'SRScurve'. It plots the species richness against the sample size (Cmin). We now teamed up and will soon release an update of the 'SRS'-package that includes a finalised version of his function. We will also expend this function to not just species richness but also other alpha-diversity indices such as Shannon index.
Here it's done for the first 3 samples of your dataset using @vheidrich's preliminary 'SRScurve'-function. We also plotted the 'rarecurve'-function for comparison (R code at the end of this message, just add this to your script):
You will immediately notice a zigzag behaviour of the SRS curve
. This results from the combination of scaling and ranked subsampling of the fractional part (Cfrag). However, first thing I did of course was a validation of our algorithm and it's not caused by an error in the SRS algorithm itself. Currently, we don't know any mathematically justifiable solution to get a steady decay in species richness with decreasing Cmin. However, as we have shown in our
SRS paper as well as in the figure above, compared to rarefy, SRS keeps more species because random subsampling without replacement (even if performed repeatedly)
will discriminate against species with low count numbers. Therefore, I am OK with having species counts that are not constantly decaying
.
I cannot make recommendations regarding
minimum sampling depth because it always depends on your data and your goals of analysis. However, what
@vheidrich can tell you from his experience is that your minimum sampling depth using SRS is likely to be lower than a reasonable 'rarefaction cutoff' and I believe the figure above illustrates this.
- Once I’ve retained the SRS output, I noticed that the row.names were dropped. Those were the featureIDs, and obviously I need those back ! Am I correct in assuming that the rows are not resorted between the input/output data.frames when the SRS function is applied? In other words, can I just add those row.names back in using the original input file to SRS?
Very good point! We did not apply any resorting to the SRS output: you can simply add the original row.names back to the output file. We will explicitly mention this in the updated version of our SRS-package, thanks for your input !
Cheers,
Lukas
library(vegan)
SRSnspec<-function(data,Cmin){
nspec=as.data.frame(matrix(data=NA,nrow = ncol(data),ncol=length(Cmin)))
colnames(nspec) <- paste("N", Cmin, sep="")
rownames(nspec) <- colnames(data)
for (i in seq(1,ncol(data),1)){
j=1
for (sample in Cmin){
nspec[i,j]=length(which(SRS(data = as.data.frame(data[,i]), Cmin = sample)>0))
j=j+1
}
}
attr(nspec, "Subsample") <- Cmin
return(nspec)
}
#plot the number of observed species per sequencing depth/sample size
#it works very much like "rarecurve" function from vegan
SRScurve <- function(x, step = 1, sample, xlab = "Sample Size", ylab = "Species",
label = TRUE, col, lty, ...) {
## matrix is faster than data.frame
x <- as.matrix(x)
## check input data: must be counts
if (!identical(all.equal(x, round(x)), TRUE))
stop("function accepts only integers (counts)")
## sort out col and lty
if (missing(col))
col <- par("col")
if (missing(lty))
lty <- par("lty")
tot <- rowSums(x)
S <- specnumber(x)
## remove empty rows or we fail
if (any(S <= 0)) {
message("empty rows removed")
x <- x[S > 0,, drop =FALSE]
tot <- tot[S > 0]
S <- S[S > 0]
}
nr <- nrow(x)
## rep col and lty to appropriate length
col <- rep(col, length.out = nr)
lty <- rep(lty, length.out = nr)
## Rarefy
out <- lapply(seq_len(nr), function(i) {
n <- seq(1, tot[i], by = step)
if (n[length(n)] != tot[i])
n <- c(n, tot[i])
drop(SRSnspec(as.data.frame(x[i,]),Cmin = n)) #using the function declared above
})
Nmax <- sapply(out, function(x) max(attr(x, "Subsample")))
Smax <- sapply(out, max)
## set up plot
plot(c(1, max(Nmax)), c(1, max(Smax)), xlab = xlab, ylab = ylab,
type = "n", ...)
## rarefied richnesses for given 'sample'
if (!missing(sample)) {
abline(v = sample)
rare <- sapply(out, function(z) approx(x = attr(z, "Subsample"), y = z,
xout = sample, rule = 1)$y)
abline(h = rare, lwd=0.5)
}
## rarefaction curves
for (ln in seq_along(out)) {
N <- attr(out[[ln]], "Subsample")
lines(N, out[[ln]], col = col[ln], lty = lty[ln], ...)
}
## label curves at their endpoitns
if (label) {
ordilabel(cbind(tot, S), labels=rownames(x), ...)
}
return(out)
}
rarecurve(x = t(filt_data_wide[1:3]), step = 1,col=c("#000000", "#E69F00", "#56B4E9"), las=1, label = F, xlim =c(0,8000))
axis(side=2, at=seq(0,130 ,by=1), labels=NA, tck=-0.015)
par(new=T)
SRScurve(x = t(filt_data_wide[1:3]), step = 1, col = c("#000000", "#E69F00", "#56B4E9"), las=1,label = F, xlim =c(0,8000))