How to select a Sampling-depth

Hi, may I know is there any “minimum” number of reads that we need in the sampling step in order to produce a good and reliable result in the core-metrics-phylogenetic analysis? I have a set of 16s data and my max sequence reads is 308443 and minimum is 976 before running dada2. After quality filtering, the sequence reads go down to max 109,977 and min 78. The last five reads after filtering are 15,232; 12,123; 6,202; 791 and 78 respectively. Where should I set the sampling-depth in this case?


Hi @Clara,

That’s an important question but unfortunately there isn’t a very clear and easy answer as it depends on many factors. I made an attempt at highlighting some of the factors on another thread here which might help you decide. That being said, I think for most common samples 6,000 would be sufficient coverage. What is the sample type being analyzed? I personally would drop the last 2 samples (791, 78) and rarefy at 6,202.


Dear Bod,

THanks a lot for the reply and the link which I never notice before asking. It is really useful.
The samples that I am working with are human microbiome samples. However, I have a query here, if let’s say I choose 12123 as the sampling depth, if I am right, this means that all my samples will be subsampled at 12123 reads and thus if I have a sample which has 100,000 reads, it will actually analyse only ~15% of the total read. It is worth to do so? And how do you know 6000 is sufficient? At the moment, I havent got a chance to come into contact with papers regarding this, do you have any to recommend?



1 Like

Hi @Clara,

You’re absolutely right, with the rarefying technique you are tossing away the majority of your hard-earned reads, in fact that is one of the major cited points against rarefying. However, as discussed in the papers that was linked in the other thread, the choice of rarefying or normalization seems to depend on the data itself. When I suggested that 6,000 is sufficient that is based mainly on my own experience and what I have seen others use/report. The paper on the other thread compared rarefying 3,000 reads/sample to other normalization techniques and found it to be comparable in most cases, so I would say 6,000 reads for human data is likely sufficient still. This is very much so an active topic of research and there aren’t yet any standard rules and techniques to deal with it across all cases. That being said, some available techniques such as ANCOM and gneiss, use compositional data instead of raw counts and so circumvent -for the most part- the issue of uneven sampling depth. You would still remove low counted samples in those analyses though. For other analyses such as alpha & beta diversity the reads still need to either be normalized or rarefied. Currently, only the latter option is available in qiime2, but I imagine normalization techniques will arrive soon enough here as well. Hope that helps a bit!


Hi @Mehrbod_Estaki

Thanks again. Another thing to confirm, when you said “rarefying” here, if I am correct, you meant the --p–sampling depth in the core metric analysis and not alpha Alpha rarefaction plotting?

1 Like

Correct! Rarefying refers to the act of randomly throwing away reads to a defined level which is exactly what the sampling depth option does. The alpha rarefaction plots are generally used to determine the upper limits of richness/evenness in a sample and is one method one can use to determine a ‘sampling depth’ that is sufficient to capture the full richness of the community.


Your target sequence depth depends on what your aims are.
Last time I did 16s on gut microbiome samples, we checked a number of cutoffs after clean up.
We had 3 different typed of samples, including biopsies and faecal.
Our aim was targeted towards comparing the different sample types so we set our cutoff quite low at anything over 1000 reads.
To many this may seem quite low, and I would agree that for investigating rare species that you would exclude some rare species.
However this was essential to include the biopsy samples we had.
At the Genus level we compared other cutoffs at 2,000 , 5,000 , 10,000.
Form this we had two points that may help you:

1.We found a cutoff of 5,000 was sufficient to avoid a loss of rare species.

  1. We also found that even at a cut off of 1,000 we lost less than 1% of the community in the samples that had over 5,000 reads, meaning we could include 90% of our samples and still get an adequate representation of the Genus from each sample.

@Christopher_Poulton, thanks for the sharing, very useful :grinning:

1 Like

An off-topic reply has been split into a new topic: Selecting and visualizing alpha/beta diversity analyses

Please keep replies on-topic in the future.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.