Now I have some confusion about the concept and processes of rarefaction-cutoff and rarefying in detail.
I have concept that rarefaction cut-off typically refers to a method in which samples with low read counts are excluded from an analysis. That being said, simple exclusion of samples without any more randomization for further analysis.
However, I found some explanation that the term "rarefaction" refers to the process of randomly subsampling a fixed number of reads from each sample to standardize the number of reads across samples in some other context, which is very similar to concept of normalization in my understanding.
Are the words used for generally different meaning or rarefaction-cutoff includes more steps of randomization after excluding certain samples that are below threshold readcounts ? If then, the main difference of normalization and rarefaction is the first step that excluding some samples?
Please help me by comparing the normalization, rarefying and rarefaction cutoff ...
Hi @SingeunOh,
I'll describe the way I have in mind. First of all, to perform any diversity analysis you need to apply a normalisation to your dataset. There are many normalisation methods, among which normalisation by rarefaction is one of the oldest and still very much used.
The process of "rarefaction" is exactly the one you describe above. In order to apply this method, you need to decide a rarefaction threshold (or as you call it a rarefaction cutt-off), to subsampling this fixed number of sequences from any of your sample. At the end, any samples with sequence count higher than the threshold will be included into the final output with number of sequence exactly as your fixed threshold, any samples with less than that amount will be discarded. The sequences in each sample are chosen randomly, so if you perform this rarefaction step many times you could get slightly different results; also depending on the implementation of the subsampling process, a sequence could be put back in the pool or be excluded from the following draw.
So, "rarefaying" is the normalisation method that you perform by applying a "rarefaction cutoff" to all of your samples.
Thank you dear Luca. Now I understand that rarefaction is one kind of normalization method and it includes random subsampling process. It helps me a lot more about the normalization method.
May I ask you one more question ? If the subsamplling is done with some samples, how many sample is chosen for one iteration usually ? Is there any usual sample number for one iteration ? Besides, how many times are sampled for the final calculation ? Could you explain me about the detailed process of rarefaction ?
If anybody knows this well, please help me. Thanks.
The normalization works because eventually all the normalised samples will have the same number of sequences hence they are meant to be comparable.
The subsampling is performed on any samples with total read count equal or higher than the selected threshold.
On how many times the sequences were subsampled for the final number, it will depend on the implementation, in the qiime2 diversity pipeline there is an option to specify the number of times you would like the subsampling to be performed.
Hope it helps
Luca
Hi @SingeunOh ,
as final suggestion,if you like to have more information on normalisation methods and their limitations, I suggest to look at:
Lin H, Peddada SD (2020). Analysis of microbial compositions: a review of normalization and differential abundance analysis. npj Biofilms and Microbiomes 6 (1): 60.
Hope it helps
Luca
Stumbling on this post a couple of weeks late but just wanted to share some thoughts as it seemed relevant.
First a note on the terminology because this is almost always conflated in the microbiome literature.
Rarefy/rarefied: This is a normalization method that uses random subsampling (usually without replacement) to normalize all samples to a minimum library size N. You also discard any samples that don't have at minimum N reads. So the result is all samples having the same number of reads/observations. I noticed that most of the times in microbiome papers when folks say "rarefaction" they actually mean rarefy. In QIIME 2 diversity plugins, the "rarefying" or subsampling process is done once and you set the min library size with the --p-sampling-depth parameter.
So, this is actually not the case for QIIME 2's core plugins that rarefy in q2-feature-table and q2-diversity. However, you can achieve this repeated rarefying in a few packages in R and @yxia's QIIME 2 plugin q2-repeat-rarefy. Whether creating a feature-table based on the average of repeated rarefying is appropriate or not is debatable in my opinion, both have their own strength and weaknesses and may just depend on your question.
Rarefaction: This is actually a technique initially developed to estimate species richness across environments with different sample sizes and relies on rarefaction curves. This isn't a normalization method, so much as a richness estimator. It is similar to rarefying/subsampling in that it involves randomly drawing n reads from your total reads N (per sample), but the objective here is to repeat this process with n of greater sizes until your richness curve reaches an asymptote, at which point that limit is considered your richness. So it is really tied to an abundance curve, which differentiates it from rarefy
The original author, Howard L. Sanders, described it :
The rarefaction method, instead, is dependent on the shape of the species abundance curve rather than the absolute number of specimens per sample
Rarefaction can be done in QIIME 2 using the alpha-rarefaction and beta-rarefaction actions and you can choose the range of n, the # of iterations at each step. The output of these actions however are visualizations and do not produce some sort of rarefied feature-table that you can use downstream.
Now, to answer your question about how rarefaction cutoff, rarefying, and normalization relate to each other: all three methods aim to standardize the sequencing depth across samples, but they do so in different ways.
Normalization usually involves adjusting the read counts in each sample based on some factor (e.g. total number of reads, library size) to make the samples comparable. Rarefying involves subsampling a fixed number of reads from each sample to make the sequencing depth consistent across samples. Rarefaction cutoff involves excluding samples with very low read counts from an analysis.