Sequence Counts/Library Sizes - Deciding what threshold to filter out

Ellenphant · June 24, 2020, 2:51pm

Probably another question that has no true answer but is something that I am finding very interesting to think about.

Are there common gold standard rules for what sequence counts to filter out (i.e. if it is below 25, remove it) and what library sizes to keep (i.e. below 10,000 we don't want to look at that sample)?

I imagine for the library size threshold it is like dependent on sample characteristics and whether a sample with 5000 sequences can capture the same levels of information as a sample with 50,000 sequences.

But for the individual sequence counts, that is where things get muddier in my opinion - because aren't we more so making decisions on a value that we deem appropriate? If you are filtering out everything below 25 reads, are the ASVs with 26 and 27 reads really that much better to keep? And even for singletons, we often remove singletons during denoising (to my understanding)...but keep variants with counts of 2 (unless you do other sequence count filtering) - does that extra read count for the variants with counts of 2 instead of 1 really make it that much better?

Just interested to hear what other people think!

ChrisKeefe · June 24, 2020, 5:08pm

@Ellenphant, as you suggest, thresholds will vary greatly depending on the study design and the data. High-biomass (e.g. fecal) samples will likely be filtered at much higher sequencing depths than low-biomass (e.g. desert soil) samples.

I'm sure someone else here has more experience in the literature on this, but here are a couple of papers (Auer 2017, Bokulich 2013) I dug up last week that discuss filtering approaches, and in the case of Bokulich, a suggestion for how that threshold might be chosen.

Chris