Sufficient number of reads in a sample for an analysis

the_dummy · January 17, 2020, 10:31am

Hello,

I would like to discuss when to drop a sample because of a small number of reads, or to continue.

I don't know why but sometimes I get so different number of reads in a run, like X has 40k or even 10k whereas Y has 250k.

I need a perspective about when to re-run a sample or just completely drop it.

Is there anybody who can lead me to a good point of view?

jwdebelius · January 17, 2020, 2:59pm

Hi @the_dummy,

I drop sample with less than 1000-2500 sequences per sample in most of my sample types, unless there's a clear cut off in my data (a jump of 1000 sequences with a low number of samples.) Then, I rarefy to this depth and work form there. In my experience/opinion, deeper sequencing doesn't buy you much more than noise in most datasets with more than 10K reads.

In terms of keep vs re-run: it depends. I think if they're really special samples and you have to re-run, you should re-run everything to make sure it's comparable. And maybe consider a re-extraction?
But, I also just try to build extra space into my studies. On fecal samples, I like to assume a 5-10% failure rate (10% on smaller studies because those invariably have more issues); maybe 10-15% in oral, and 30-50% in skin/vaginal, but there are bacvk of the hand "failure" rates.

You may alos just find some samples are low biomass.

...All said, I'm sure @colinbrislawn and several others have excellent opinions and diversity is once again a strength in the microbiome. (Except that vaginal microbiome.)

Best,
Justine

colinbrislawn · January 17, 2020, 2:59pm

Good to see you again!

There is no single rule about how best to do this. I try to pick a number as high as possible without making my cohorts too unbalanced.

Sample	reads
TreatmentA1	34859
TreatmentA2	48595
TreatmentA3	28495
TreatmentA4	76937
TreatmentA5	76284
TreatmentA6	97634
TreatmentB1	5868
TreatmentB2	1759
TreatmentB3	9847
TreatmentB4	2846
TreatmentB5	3857
TreatmentB6	3776

So I would love to set my minimum to 20k reads... but then I lose all my TreatmentB samples!

Here, I might set my minimum to 3K reads, that way I only lose two samples and still have a n=6 and n=4 in my two cohorts.

Or keep all of the samples so that I have equal groups in my test.

the_dummy · January 21, 2020, 6:36am

There are the times I have re-run some samples and got some improvement, like there were 3 taxa in first run and ~30 taxa in the second run.

And there are the times I dropped the sample because the sample which has low read count was in a group consists of like 6 samples, so there I had other good samples to continue with.

@colinbrislawn, are the samples of A and B from different runs? If not, why there is such a drop of read counts?

Thank you @jwdebelius, @colinbrislawn, I got the answer for what to do when this happens. But I would like to point the real problem even if it is a problem with the platform.

I believe this is about the platform because when I just re-run without changing anything or re-extraction, read count can jump from 2k to 100k. Plus, phred scores are getting worse and worse with each run.

I'm poking it this much because I don't have the platform. If the reason is the platform, I would think that they are not taking a good care of it, and plan to change where I send the cartridge

jwdebelius · January 21, 2020, 8:55am

You can have a wide range on a sequencing run. Keep in mind that there are multiple stochastic processes in 16s rRNA sequencing that affect read depth. Extraction efficiency is one, but you also have PCR efficiency and flow cell adherence efficiency. These can all result in varying read depths for multiple samples in the same run. Re-extraction may save some of the samples, but some just have lower counts.

I also want to mention that to me, 10K is not "low". I'm less comfortable around 1K (although hats my absolute minimum) and more around 5K, but in most of the study sizes Im working with, increasing read depth to 10K or 100K just increases the number of taxa I can't test. It is a little bit weird to me that @colinbrislawn has such big gaps by treatment (unless these are spoofed?) but tthat may be a secondary sample issue! If you're looking for novelty, this might be a solution for you, but from a "I want to do statistical analysis perspective", deeper is not always better

Best,
Justine

colinbrislawn · January 22, 2020, 4:05pm

This is fake data I made up as an example! Fully spoofed!
(Is it still called a 'batch effect' if the batches are fake? )

I also agree that 10K reads per sample is very good. For comparison, the Earth Microbiome Project published in Nature while rarefying to only 5k reads per sample.
https://www.nature.com/articles/nature24621#Sec3

Nicholas_Bokulich · January 22, 2020, 5:19pm

You can also use objective methods to determine a minimum acceptable read depth. The alpha and beta rarefaction methods allow you to determine the effects of rarefaction on alpha and beta diversity, which can be used as guides for your analysis. Minimum acceptable depth will depend on the diversity present in a sample, so (compared to using a rule-of-thumb approach) these methods can allow you to select acceptably lower read depths in low diversity samples and prevent you from going too low in high diversity samples.

the_dummy · January 23, 2020, 6:16am

I laughed hard at this since you took your time to make those exact numbers

Thanks to all of you, my mind is clear now.