I processed my 16S rRNA amplicon sequencing data in QIIME 2 (using Deblur).
The pipeline was: raw input reads → joining → trimming → Deblur denoising → chimera filtering.
In the end, only about 25% of the raw reads remain as non-chimera reads.
In other words, the overall retention rate from raw to final non-chimera reads is ~25%.
From what I’ve read, typical expectations are:
~50–90% retention after joining (depending on read length/quality),
additional loss during denoising (Deblur often removes 30–50% of reads),
~10–30% removed during chimera filtering.
That suggests that 40–70% final retention is often considered “healthy”, although values as low as ~25–30% are sometimes reported, especially with lower quality data or strict trimming parameters.
My questions:
Is ~25% final retention still acceptable for downstream diversity analyses?
What retention percentage is generally considered a reasonable minimum in practice (e.g. ≥30% or ≥40%)?
Any advice or shared experience would be greatly appreciated!
On high biomass samples with lots of DNA extracted with lots of PCR product, keeping 70% or 80% is possible.
But when biomess drops, then less DNA is extracted, then more PCR cycles are needed, and more chimeras are made.
I would argue that it's better to have fewer, high quality reads, than more reads of worse quality. Even though it makes total retention go down, I want chimera to be removed!
It's a tradeoff between quality and quantity, which depends on the biological context.
In addition to percent of reads that pass filter, consider the count of reads that pass filter. For example, if I'm using NextSeq, I might have 200k reads per sample, and keeping only 40% still results in 80k reads per sample.
Is 80k reads per sample enough?
Well, you can run a power analysis if you know what power you need to answer a specific question.
This Nature paper used "an average of 80,000 sequences per sample" and they got published!
I find that reviewer 3 rarely bothers me about read depth