Why shouldn't I just filter based on quality score and throw away those with low scores?


Using dada2 or deblur is essentially taking out the low quality data. Is there a reason I should use one of these and not just filter based on the quality score? Thanks.

this is a bit of an oversimplification! E.g., dada2 does not only remove low-quality reads, it also aims to correct the errors in reads to recover the true signal, rather than throwing away or trimming a read wherever an error is observed.

Both dada2 and deblur actually use a Q-score-based filter as the first step (dada2 does this automatically, deblur should be preceded by a qiime quality-filter step to perform this filter). This is only a rough filtering step, though, and does not catch (let alone correct!) all errors.

The best evidence for this is really in the literature; I’d advise you to look at the dada2 and deblur papers and other benchmarks that compare these methods vs. QIIME 1 (where the only option was rigorous Q-score-based filtering, essentially step 1 in the dada2/deblur workflows).

Q-score-based filtering is an imprecise tool: either errors creep in to your data or you trim so much data you are left with nothing — see here for a benchmark of Q-score-based filtering to give you an idea that a rough filter alone is not enough.

dada2 and deblur == a fine cup of espresso :coffee:
q-score filtering == straining the grinds through your teeth :cowboy_hat_face:


I’ve never seen such a beautiful metaphor in my life wow thank you for that amazing explanation and also that image of coffee grounds in my mouth I’ll never ever be able to forget.