Analysis with DADA2... Merge Read Length

Is length‑based filtering defensible?

I am working with paired‑end 300 bp Illumina reads targeting the V3–V4 region.

For filtering, I set truncation lengths based on the quality plots (forward reads truncated to 260 bp, reverse to 240 bp). Error learning for both directions looked clean and the subsequent merging step was efficient, so on the surface there were no obvious problems with overlap or read quality.

The issue appears when I look at the merged ASV lengths. Using nchar(getSequences(seqtab)) to summarize lengths, I don’t see the tight, single peak I would expect around the typical V3–V4 amplicon size. Instead, there is a very strong mode around ~291 bp. Given the truncation settings and good merge performance, I don’t think this is just an overlap failure artifact. To dig into this, I took several abundant ASVs from the ~291 bp class and ran BLAST searches against the nucleotide database. The top hits were to mammalian nuclear/lncRNA loci rather than to bacterial 16S rRNA genes, with good identity and E‑values but clearly mapping to host genomic regions. That makes me suspect that this short, high‑frequency 291 bp peak is off‑target host amplification rather than true 16S V3–V4 product, consistent with what can happen when non‑bacterial templates are present and the primers find alternative binding sites. I am working with low-biomass samples, so this seems logical to me.

What I’m now trying to decide is the most defensible way to handle this before moving on to ecology/diversity analyses. I’ve seen suggestions to filter ASVs by merged length for this amplicon, for example retaining only sequences within a plausible V3–V4 window (something along the lines of ~350–480 bp) and discarding shorter or longer sequences that likely represent non‑target amplification.

So my question is my interpretation of the dominant short‑length peak as off‑target (likely host‑derived) amplification sound, and is it considered defensible in this context to filter ASVs by merged‑sequence length to retain only plausible V3–V4 amplicons?

1 Like

@ebakes, it sounds like you're running this with DADA2 directly, not through QIIME 2 - is that correct? Someone may jump in to help if that's the case, but we focus support efforts on QIIME 2 here as that's what we're primarily using.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.