Is ~25% final non-chimera read retention acceptable, and what is the expected range?

Hi everyone,

I processed my 16S rRNA amplicon sequencing data in QIIME 2 (using Deblur).
The pipeline was: raw input reads → joining → trimming → Deblur denoising → chimera filtering.

In the end, only about 25% of the raw reads remain as non-chimera reads.
In other words, the overall retention rate from raw to final non-chimera reads is ~25%.

From what I’ve read, typical expectations are:

  • ~50–90% retention after joining (depending on read length/quality),
  • additional loss during denoising (Deblur often removes 30–50% of reads),
  • ~10–30% removed during chimera filtering.

That suggests that 40–70% final retention is often considered “healthy”, although values as low as ~25–30% are sometimes reported, especially with lower quality data or strict trimming parameters.

My questions:

  1. Is ~25% final retention still acceptable for downstream diversity analyses?
  2. What retention percentage is generally considered a reasonable minimum in practice (e.g. ≥30% or ≥40%)?

Any advice or shared experience would be greatly appreciated!

Hello @gy.park,

Welcome to the forums! :qiime2:

As is often the case, it depends ™

On high biomass samples with lots of DNA extracted with lots of PCR product, keeping 70% or 80% is possible.

But when biomess drops, then less DNA is extracted, then more PCR cycles are needed, and more chimeras are made.

I would argue that it's better to have fewer, high quality reads, than more reads of worse quality. Even though it makes total retention go down, I want chimera to be removed!

It's a tradeoff between quality and quantity, which depends on the biological context. :balance_scale:

In addition to percent of reads that pass filter, consider the count of reads that pass filter. For example, if I'm using NextSeq, I might have 200k reads per sample, and keeping only 40% still results in 80k reads per sample.

Is 80k reads per sample enough? :thinking:

  • Well, you can run a power analysis if you know what power you need to answer a specific question.
  • This Nature paper used "an average of 80,000 sequences per sample" and they got published!

I find that reviewer 3 rarely bothers me about read depth :person_shrugging:

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.