Sequence length statistics

Hello,

I have a question about which read lengths to accept and which to filter out.

The data i have is 4V 2 x 250 paired-end Novaseq (so a lot of reads).
The overlap should be about 210 bp and total merged read length after denoising (dada2 denoise-paired) should be about 250 bp excluding the 19 bp forward and 20 bp reverse primers (F 515 R 806).

I gave this command:
qiime dada2 denoise-paired --i-demultiplexed-seqs WesternDietSertpaired-end-demux.qza --p-trim-left-f 20 --p-trim-left-r 20 --p-trunc-len-f 0 --p-trunc-len-r 0 --o-table tableWesternDietSert.qza --o-representative-sequences rep-seqs-WesternDietSert.qza --o-denoising-stats denoising-statsWesternDietSert.qza --verbose

It gives me a large range of read lenghts 69-450 with a mean of 239


Now which reads should i filter out? i could use only the 25-75% interval 247-252 read length? This way i loose 50% of my data but i take only the closest match.

Something that puzzles me is that i had previously mistakenly truncated 20 bp forward and reverse besides trimming 20 bp forward and reverse. In this setting, despite cutting 20 bp of biological signal, my read statistics look more consistent with the range being 210-408 and the 9-75% interval being 249-252. I would have thought that giving more of the reads would improve read statistics. --p-trim-left-f 20 --p-trim-left-r 20 does mean i cut the 20 bp at the beginning of the forward and reverse reads right?
and --p-trunc-len-f 0 --p-trunc-len-r 0 is intended for cutting the end of the sequence, that i have now opted out of.

With only trimming 20 bp the beginning i now have many more features 4040 vs 2814, so that seems to have improved…


Note that the q scores look good so (above 30) so i don’t have to cut anything at the end. I did it because i thought there was a reverse primer at the end of the read but there is not.

Thanks for your advice on filtering out read lengths.

bests Mirjam

Hi @MirjamBloem! Thanks for the detailed post, this is really helpful.

Taking a step back, what did the raw read lengths look like? Can you share some of the demux summarize results with us? In particular the read length distribution table. I am also curious about the denoising stats from DADA2 - that would be really helpful here, too.

Will wait to hear back from you. Thanks!

Hi @thermokarst,

See below the demux summary information from the visualization (qzv)



Here the denoising information for the reads only trimmed for the first 20 bp:


And here denoising information for the reads when first and last 20 bp were trimmed:


image

image

Not sure this is what you were looking for. I also upload the denoising statistics .qzv’s
denoising-statsWesternDietSert230.qzv (1.2 MB) denoising-statsWesternDietSertjust20.qzv (1.2 MB)
Thanks for your help.
Bests Mirjam

1 Like

Regarding the above question and unexpected variation in bp length, I wonder if these sequences should be taken care of e.g. by removing them. Is this a common thing to do QIIME2?

Thanks @MirjamBloem - I am going to recategorize this as “General Discussion” - I misunderstood the initial post when I first replied and I thought that you were experiencing a problem, but it actually looks like you’re on the right track, and just need to spend some time getting up to speed on DADA2 - please let me know if I misunderstood. Thanks!

Hi @Joanna, it depends on the target gene and primer set used, but I am not sure I’d describe this as a common scenario. It is certainly possible to filter sequences by length, though! See here:

Hi,

Indeed i used this "qiime feature-table filter-seqs \ --i-data seqs.qza \ --m-metadata-file seqs.qza \ --p-where ‘length(sequence) > 4’ \ --o-filtered-data filtered-seqs.qza " option to filter the sequences.

Though i wonder

  1. how it can be that when (mistakely) eliminating the last 20 bp using the ‘trun’ setting i have more consistent read lengths compared to not cutting the last 20 bp and

  2. i wonder if where to put the cut-off for trimming the reads? E.g. if you expect a read length of 250 bp, should you filter out anything below 200 and above 300? This way i would filter out 50% of the reads…i.e. imo really too much.
    It may also be that very short/long reads will not be classified and filtered out after taxonomy? This would make it not necessary to filter the reads after denoising based on the read length statistics.

Bests Mirjam

@Nicholas_Bokulich thank you for your reply. So, in this particular example V4 region is used, so the expected length would be around 250bp. The percentile (see the picture at the top of this discussion or below this comment) indicates a bigger range than expected. I wonder what would you do in this situation? Would you leave as it is or it is better to filter them out based on the expected length?