Why is non-chimera read count often referred to as sequencing depth in 16S rRNA analysis?

Hello,

I have a question regarding the definition of sequencing depth in 16S rRNA amplicon sequencing workflows.

In whole-genome or shotgun metagenomics, sequencing depth (coverage) is usually defined as:

Depth=Total bases sequenced/Genome size​ (reference)

However, in many 16S rRNA studies, sequencing depth is instead described simply as the number of usable reads per sample (often after quality filtering and chimera removal, i.e., non-chimera reads).

I understand this is a practical convention, since all reads are usually trimmed to a uniform length (e.g., ~400 bp for V3–V4), so “number of reads” is proportional to total bases. But conceptually, a single read is not equivalent to a species or a genome, and in fecal samples with many species, each read is just a short fragment of the 16S region.

So my questions are:

  1. Why is non-chimera read count typically referred to as sequencing depth in 16S analysis?

  2. Is this simply a convention due to uniform read length, or is there a more formal reasoning behind this terminology?

  3. Are there recommended references (guidelines, QIIME/Mothur docs, or papers) that explicitly define sequencing depth this way for 16S?

Thank you very much!

Hello again, @gy.park,

This is a very good question because it highlights what makes amplicons different from genomics, RNA seq, or other untargeted sequencing methods.

Once we add and focus on this missing detail, we can answer the other question!

each read is just a short fragment of the 16S region...

...that has been amplified with PCR!
Amplicon: From ampli(fy) + -icon (as in replicon). amplicon - Wiktionary, the free dictionary

:dna: :repeat_single_button: :sparkles:

And this answers your first question:

  1. Why not include chimeric reads in the count? Chimeric reads are an non-biological artifact from the PCR process, so we don't want to count them!

PCR amplification is an essential step in the wet lap, and directly changes what our reads look like in the dry lab!

I don't do this...

I often work with 16S V4 amplicons, from paired-end Illumina sequencing.

Depending on read quality, I might trim them to 180 bp, or 200 bp, or 220 bp, or whatever, and yet after I join the reads together, the resulting amplicon is always 250 bp long, because the PCR primers made a PCR product that was 250 bp long!

convention due to uniform read length

Yes, this is a convention. But it's due to the uniform length of the PCR products we are sequencing.

I'm not sure I'm explaining this very well...
Let me know if you have more questions!

3 Likes

Unlike genome sequencing, each read in 16S data originates from potentially different species present in the community. In other words, the non-chimeric read count includes reads from many taxa.

So my updated question is:

  • Why is the total non-chimeric read count (across all taxa) still referred to as sequencing depth in 16S analysis?

  • Is this simply a practical proxy (i.e., usable sequencing effort per sample), or is there a more formal reasoning for treating this pooled count as “depth”?

  • Are there guidelines or references that explicitly address this distinction between genome coverage and amplicon read depth across mixed communities?

Thank you very much!

Hi @gy.park,

just following on @colinbrislawn excellent answer.

Yes, the total number of the non chimeric read (the number given in the dada2 statistic) does include all the species in the sample. The count for each species will be produced after the taxonomic classification of each read (by using sklearn plug in for example).

On the sequencing dept. I use to more generic definition:

Expected average sequencing depth = Total sequenced bases / length of the target

The length of the target depend on what experiment you are dealing with.

If you are working on a whole genome sequence, it is indeed the length of the genome of the species. But keep in mind that there could be still region with low or no coverage due to biases in the library preparation.

If the experiment is a exome sequencing, in which you use bates to capture only the known exomic regions for a species, the length of the target is derived by the total length of the bates, not the length of the genome.

In a amplicon sequencing experiment, the length is given by the amplified region, which act as target. In this experiment each non-chimeric sequence covers all the amplified region, hence all the target. After removing the PCR, primers all sequences are the same length which is equal to the length of the amplicon.

I usually use the non-chimeric sequence count for each sample to have an idea if there is enough sequences to being able to identify low abundance species (but this will be better seen with the rarefaction plot); or if the library prep were all ok .. to many low-quality or chimeric sequences or positive and negative controls samples behave as expected.

I hope this clarify a bit

Luca

5 Likes

Maybe this is helpful? From a guide on RNA-seq (transcriptomic sequencing).

This is for untargeted sequencing of the exome, in contrast to targeted sequencing after PCR. Depth means the same thing.

Coverage is less often discussed for amplicon studies, because the PCR products always target the same region so coverage should be 100%. Depth is still just total read number.

Coverage CAN apply to amplicons too!

I think about coverage for amplicons during database search.

If half the length of an ASV is a 100% match to the database, this database hit will be discarded because 'coverage' is too low. Coverage still refers to how much of a sequence is covered, but here it's coverage of an ASV during database search, not coverage of a genome.

The number of copies of an ASV is still 'depth', which is just the number of reads.

3 Likes