Sequence Length Statistics - problem

lisacarraro1982 · February 12, 2019, 5:25pm

Hi,

I have 300PE reads and I expect, after merging, a sequence of 467 bp medium lenght

I run DADA2 using these scripts:

qiime dada2 denoise-paired
–i-demultiplexed-seqs 16Sdemux-paired-end_A.qza
–o-table table_A
–o-representative-sequences rep-seqs_A
–p-trim-left-f 0
–p-trim-left-r 0
–p-trunc-len-f 280
–p-trunc-len-r 240
–o-denoising-stats stats-A.qza

in the rep-seqs.qzv file I find these statistics
min lenght 299
max lenght 469

Why do I find sequences of different lenghts?

Which is the problem ?

Thank you very much

Lisa

Nicholas_Bokulich · February 13, 2019, 3:57am

@timanix's response was correct; the issue is that these sequences are being joined by dada2, and the joined ASVs are varying in length. There are a few reasons why these lengths will vary from the 467 median length you expect:

Natural length variation. I am not sure what gene you are sequencing and how much it varies in length. 16S rRNA gene regions usually do not vary much in length, but they do (probably not as much as you are observing though). Other marker genes, such as ITS, will vary to a much larger degree.
Bad read joining. dada2 requires a minimum of 20 nt overlap, but I do not know off hand how many mismatches are tolerated. You could be getting bad joining, leading to shorter sequences than expected. This seems a little unlikely, though, as the shorter reads would imply a much longer region of overlap (~220 nt!)
Non-target DNA.

Either way, I recommend exporting and manually checking the short sequences to see what they are.

Good luck!

lisacarraro1982 · February 14, 2019, 11:46am

Thank you very much!

system · March 17, 2019, 5:46pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.