Representative Sequences generated from dada2 denoising step question

Sorry if this has already been posted in the forum that I missed, but I had a question regarding the representative sequences that are generated from the dada2 denoising step. When I create a visualization of the rep-seqs.qza file, I noticed that the sequences are not all the same length. When I had performed this function on other datasets, I thought that the rep-seqs that were generated were all the same length. Is it normal for the representative sequences to have variation in length, or is there something that I performed incorrectly?

The code I used is listed below:
qiime dada2 denoise-paired
--i-demultiplexed-seqs my-demux-file.qza
--p-trim-left-f 6
--p-trim-left-r 6
--p-trim-left-f 150
--p-trim-left-r 149
--p-n-threads 0
--o-table my-table.qza
--o-representative-sequences my-rep-seqs.qza
--o-denoising-stats my-denoising-stats.qza

I am performing this on Illumina Miseq 2 x 151nt run on the 16s region, that was imported through the Earth Microbiome Project format for more information.

Length depends on the sequences themselves. In biological terms, your primers bind to some conserved areas on either end of your target amplicon. The nucleotide sequences between them are variable, and an indel event could give even very closely related strains of a given microbe different amplicon lengths.

Different regions will show different levels of length variability. Fungal ITS is highly variable, while some 16s regions usually vary by only a few nt.

Is your code copy-pasted, or hand-typed? You seem to have four --p-trim-left parameters, which is probably not right.

Bonus points:
If the first 6 nt of your 5' sequence aren't non-biological or of bad quality (just worse than the next bunch), consider not trimming those positions. If your sequences start and end at a "meaningful" location (like the beginning and end of a known region, it may be easier to use your data in meta-analyses later.

2 Likes

The code was hand typed, the second two --p-trim-left parameters should have read as --p-trunc-len-f/r

Your answer to my question makes sense, I had misinterpreted the "Demultiplexed sequence length summary" from the demux.qzv file. For some reason I thought that 100% of my sequences were 151 nt length, when that summary does not include that information necessarily.

For the bonus points, I will try not trimming those first 6 nt (they were just slightly worse than the rest) and see how that looks!
Thanks for the help!

I have always assumed that we can trust the sequencer to reliably produce e.g. 151-nt reads. I'm not actually sure that's the case 100% of the time, but I doubt that's the factor here. Speaking in broad strokes, if you were working with single-end sequences, your truncation length would dictate your sequence length.

With paired-end reads, on the other hand, DADA2 joins reads based not on where you trimmed, but on where the two sequences overlap/align perfectly for at least 12 base pairs (see the paper for details). Indel events on either side of that overlap region will be reflected in the length of the joined sequence. Try adding or subtracting nucleotides from the ----- sections of these "joined reads" while keeping the overlap nucleotides correctly aligned, and watch what happens with the overall length.

|-------------ACCGTCAAA>
             <ACCGTCAAA-------------|
2 Likes