DADA2 paired-end (EMP): All sequences of same length

ChrisKeefe · June 4, 2019, 12:19am

Hello forum friends!
I've encountered unusual results applying DADA2 to a (2x251, Illumina) sequencing run. With very permissive trim/trunc parameters, all sequences returned by DADA2 were of identical length (246 nt).

We began with exceptional quality scores and sequencing depth. Denoising stats and downstream analysis looked good, but the perfectly-uniform sequence lengths smelled fishy. We suspected read-joining might not be happening, and re-ran DADA2 with much tighter trim parameters. This appears to have resolved the issue: reads are now of variable lengths within an expected range, and incidentally, more sequences were preserved than in the initial run.

This seems to indicate that reads were initially not joining, but it is unclear why this was happening. If any of you ( @benjjneb), have any insight you can share, I'd be much obliged.

Parameters first, artifacts attached below so you can explore provenance.

Identical-length reads:

qiime dada2 denoise-paired \
  --p-n-threads 0 \
  --i-demultiplexed-seqs demux.qza \
  --p-trim-left-f 5 \
  --p-trim-left-r 5 \
  --p-trunc-len-f 0 \
  --p-trunc-len-r 203 \
  --o-table paired-data/paired-table.qza \
  --o-representative-sequences paired-data/paired-rep-seqs.qza \
  --o-denoising-stats paired-data/paired-denoising-stats.qza

"Normal" behavior:

qiime dada2 denoise-paired \
  --p-n-threads 16 \
  --i-demultiplexed-seqs demux/demux.qza \
  --p-trim-left-f 5 \
  --p-trim-left-r 5 \
  --p-trunc-len-f 180 \
  --p-trunc-len-r 90 \
  --o-table paired-data/DADA2/paired-table.qza \
  --o-representative-sequences paired-data/DADA2/paired-rep-seqs.qza \
  --o-denoising-stats paired-data/DADA2/paired-denoising-stats.qza

Notes:

I tested whether the default --p-trunc-len-f 0 parameter is working as expected by running side-by-side tests of --p-trunc-len-f 0 and --p-trunc-len-f 251. No unusual difference was noted.
These tests were run on both v18.11 and v19.4, and no unusual difference was noted. I don't think this is related to the recent update to DADA2 v1.10.

denoising-stats.qzv (1.2 MB)
denoising-stats-corrected.qzv (1.2 MB)

Thanks so much!
Chris

benjjneb · June 4, 2019, 12:41am

With very permissive trim/trunc parameters, all sequences returned by DADA2 were of identical length (246 nt).

I love a good mystery, and I've never seen this one before! Agreed that should set off alarm bells, as while the V4 length distro is tight, it isn't perfecly uniform.

But on that note, can you tell us more about the amplicon setup? Am I right in assuming this is V4? What primers? What library setup? Were primers included in the sequenced amplicons?

ChrisKeefe · June 6, 2019, 7:48pm

Yes V4, standard EMP primers (515f and 806r). We attach primers, perform a qc step (gel), quantify and pool amplicons at equal concentrations. Methods similar to Cope et al. Primers are not included in the sequenced amplicons. (I've confirmed using q2-cutadapt).

I'm not experienced with the wetlab processes at play here, so please let me know if I've overlooked some information you need.

Thanks!
Chris

benjjneb · June 6, 2019, 8:15pm

Ah I see what happened here.

The way you trimmed your reads initially resulted in the reverse read always completely overlapping the forward read. That is, the full amplicon length varies (tightly) between about 251-255 nts. But since you cut off the first 5 nts of the reverse read, it always started at a position within the forward read, and because it was truncated at 203 nts, it never extended past the other end of the forward read. So every merged read is just the length of the forward read. When you truncated sooner in the second set of parameters, the forward read no longer extended past the start of the reverse read, and you got back some length variation.

Nothing is really wrong in either case, but what I would recommend is not trimming off the initial 5 nts, mostly because that will make it harder to merge with other datasets later on that start/end at these standard primer set locations.

ChrisKeefe · June 6, 2019, 10:31pm

Thanks so much, @benjjneb! This feels so obvious in retrospect, but we never quite put all the pieces together. Explains why we didn’t see any unusual results under the poorly-trimmed parameters, but still lost a number of sequences. Read joining was happening, just not in the normal fashion.

Your point about holding onto the leading 5nt for future meta-analysis is a good one, too!
Much obliged,
CK