Expected sequence length range after denoising paired-end sequences?

Hello,

I am new to microbiome bioinformatics and have a query about the range of sequence lengths I am getting in my denoised paired-end sequences. I have searched the forum and the excellent QIIME2 tutorials but can't seem to find an explanation that answers my query.

I am running QIIME 2 version 2022.2.0 installed with conda.

I have Casava 1.8 paired-end demultiplexed fastq files which I have imported successfully, and based on the demux summarize results have truncated as follows using DADA2:

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-paired-end.qza
--p-trunc-len-f 250
--p-trim-left-f 13
--p-trunc-len-r 215
--p-trim-left-r 24
--o-representative-sequences rep-seqs.qza
--o-table table.qza
--o-denoising-stats stats.qza

The original demultiplexed sequence length summary shows both the forward and reverse reads have 251 nts, and the data is of the V4 region using 515F-806R primers. My question is based on the sequence length statistics from the rep-seqs.qzv file (screenshots below).



This seems like quite a large relative range, but I am unsure if this is normal? I am also concerned that the minimum length of 237 is the same as the truncated length of my forward reads (251-1-13=237), does this mean some reads did not merge? Or that they did but the shorter reverse reads (251-24-36=191) completely overlapped the longer forward reads? I apologise if this doesn't make sense, I am still getting to grips with the processes!

Thank you in advance for any guidance on this!

@Ash,

First off, I would use metadata tabulate (DOCS) on your stats.qza file to generate a visualization that will allow you to view statistics about your denoising run, so that rather than guessing about what happened you should be able to see if they were merged or not. Since you are trimming 24 bps from your reverse reads, it is possible that they are completely overlapped.

I am not sure that you need to be trimming so aggressively(or at all) on the 5' prime ends of your reads. Even though they are showing as being "lower" quality than the following positions, these early positions still have a low error rate. The equation for calculating Q is Q = -10 \log_{10}(P), with P being the probability of error at that position. So for a score of 32, the probability of error at the position is: \mathrm{10}^{\frac{32}{-10}} \approx 0.00063, so it might be worth running the denoising stat again without trimming at the front of each read to see if you can keep more of each sequence.

Hope this helps!

1 Like

@Keegan-Evans, thanks very much for your reply and suggestions. I checked the stats visualisation and think the merging worked ok (the percentage of input merged was ~65% and upwards).
I ran the denoising step again without trimming at the front, but trimming a bit more of the reverse reads:

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-paired-end.qza
--p-trunc-len-f 250
--p-trunc-len-r 162
--o-representative-sequences rep-seqs.qza
--o-table table.qza
--o-denoising-stats stats.qza

This seems to have fixed the merge length issue as the sequence length statistics now show a mean length of 253, which I believe is in the expected range for the V4 amplicon?


Thanks again!

1 Like

@Ash awesome! This looks better, you are right that 253 is right in the middle of the expected range.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.