Sufficient number of reads in a sample for an analysis


I would like to discuss when to drop a sample because of a small number of reads, or to continue.

I don’t know why but sometimes I get so different number of reads in a run, like X has 40k or even 10k whereas Y has 250k.

I need a perspective about when to re-run a sample or just completely drop it.

Is there anybody who can lead me to a good point of view?

Hi @the_dummy,

I drop sample with less than 1000-2500 sequences per sample in most of my sample types, unless there’s a clear cut off in my data (a jump of 1000 sequences with a low number of samples.) Then, I rarefy to this depth and work form there. In my experience/opinion, deeper sequencing doesn’t buy you much more than noise in most datasets with more than 10K reads.

In terms of keep vs re-run: it depends. I think if they’re really special samples and you have to re-run, you should re-run everything to make sure it’s comparable. And maybe consider a re-extraction?
But, I also just try to build extra space into my studies. On fecal samples, I like to assume a 5-10% failure rate (10% on smaller studies because those invariably have more issues); maybe 10-15% in oral, and 30-50% in skin/vaginal, but there are bacvk of the hand “failure” rates.

You may alos just find some samples are low biomass.

…All said, I’m sure @colinbrislawn and several others have excellent opinions and diversity is once again a strength in the microbiome. (Except that vaginal microbiome.)



Good to see you again!

There is no single rule about how best to do this. I try to pick a number as high as possible that does not throw away all my samples from a cohort in my study.

Sample reads
TreatmentA1 34859
TreatmentA2 48595
TreatmentA3 28495
TreatmentA4 76937
TreatmentA5 76284
TreatmentA6 97634
TreatmentB1 5868
TreatmentB2 1759
TreatmentB3 9847
TreatmentB4 2846
TreatmentB5 3857
TreatmentB6 3776

So I would love to set my minimum to 20k reads… but then I lose all my TreatmentB samples! :scream_cat:

Here, I would probably set my minimum to 3K reads, that way I only lose two sample and have at samples in both my treatment cohorts.

Or keep them so that I have equal groups in my test. :man_shrugging:


There are the times I have re-run some samples and got some improvement, like there were 3 taxa in first run and ~30 taxa in the second run.

And there are the times I dropped the sample because the sample which has low read count was in a group consists of like 6 samples, so there I had other good samples to continue with.

@colinbrislawn, are the samples of A and B from different runs? If not, why there is such a drop of read counts?

Thank you @jwdebelius, @colinbrislawn, I got the answer for what to do when this happens. But I would like to point the real problem even if it is a problem with the platform.

I believe this is about the platform because when I just re-run without changing anything or re-extraction, read count can jump from 2k to 100k. Plus, phred scores are getting worse and worse with each run.

I’m poking it this much because I don’t have the platform. If the reason is the platform, I would think that they are not taking a good care of it, and plan to change where I send the cartridge

1 Like

You can have a wide range on a sequencing run. Keep in mind that there are multiple stochastic processes in 16s rRNA sequencing that affect read depth. Extraction efficiency is one, but you also have PCR efficiency and flow cell adherence efficiency. These can all result in varying read depths for multiple samples in the same run. Re-extraction may save some of the samples, but some just have lower counts.

I also want to mention that to me, 10K is not “low”. I’m less comfortable around 1K (although hats my absolute minimum) and more around 5K, but in most of the study sizes Im working with, increasing read depth to 10K or 100K just increases the number of taxa I can’t test. It is a little bit weird to me that @colinbrislawn has such big gaps by treatment (unless these are spoofed?) but tthat may be a secondary sample issue! If you’re looking for novelty, this might be a solution for you, but from a “I want to do statistical analysis perspective”, deeper is not always better



This is fake data I made up as an example! Fully spoofed!
(And there does appear to be a big difference between my mock cohorts… :thinking: )

I also agree that 10K reads per sample is very good. For comparison, the Earth Microbiome Project published in Nature while rarefying to only 5k reads per sample. :woman_shrugging:



You can also use objective methods to determine a minimum acceptable read depth. The alpha and beta rarefaction methods allow you to determine the effects of rarefaction on alpha and beta diversity, which can be used as guides for your analysis. Minimum acceptable depth will depend on the diversity present in a sample, so (compared to using a rule-of-thumb approach) these methods can allow you to select acceptably lower read depths in low diversity samples and prevent you from going too low in high diversity samples.


I laughed hard at this since you took your time to make those exact numbers :rofl:

Thanks to all of you, my mind is clear now.