Denoising using DADA2

Hello!
I am stuck with one thing. I am using QIIME2 for my 16S Anslysis. I am trying to filter reads in the denoising step and I am getting the representative sequence set which i am not able to understand. I hereby share some stats of the denoising step performed using dada2 in the table below:

|Trunc-Len Reads Non-Chimeric Sequences|
|0 420355 1946
|40 52320 1308
|100 455600 4556
|200 104200 3521

300 2400 8

As per what I understood, it is filtering out the bases above the the given trunc length.

What I don’t understand is why it is also not considering those reads which are less than the given trunc length. It only considers the reads with length more the the trunc length provided and truncates the remaining bases.
Also, I do not understand, why the representative sequnces set is of the exact length as that of the trunc length. Whatever the trunc length is given, the representative set becomes of that length exactly as the trunc length.

I dont understand why this is happening. What can be the consequences of these in terms of assigning the taxonomy specially in case of de-novo based method.

Please help me learn and understand the parameter so that I can proceed with the elaborate knowledge in order to analyse my data correctly.

Thanks to all of you in advance.

Best Regards,
Rahul

Welcome to the forum @rahul!

Yes, you are truncating the reads at a specified length. This is done to remove any low-quality nucleotides present after that base position.

You could also disable this truncation by setting the value to 0, as you have done above. But this is discouraged unless if you have very high-quality reads.

You could also truncate based on Q-score with the --p-trunc-q parameter, instead of truncating by length. This will truncate reads where Q score drops below some threshold, but not truncate or filter by length.

dada2 assumes that you have not done any preliminary trimming or processing of the reads. To operate, dada2 needs the quality scores at each base position across a subset of sequences to train its error model. If some of these reads are shorter than others, it cannot build this model.

So filtering shorter reads is done to remove rubbish reads that are too short to use.

Because you are actually truncating your reads at that specified length.

Depends how much you truncate and what marker gene you are targeting. Truncating a little bit probably does not matter much, but a 50 nt sequence will most likely yield less specific information (i.e., you will get shallower taxonomic classifications) than a 200nt sequence from the same region.

See the help documentation qiime dada2 denoise-paired --help
And qiime2.org
And the dada2 documentation (this is for running dada2 in R, not in QIIME 2, but has FAQs and more details on what dada2 does and why): https://benjjneb.github.io/dada2/index.html

Thank you so much Nicholas. Now i understand why was having so less number of reads post denoising. Thank You so much for sharing your knowledge.
One last thing can you please tell me a way if possible to by-pass and increase the number of reads post denoising in case of a cut-off length of 200 or 250. As the sequences that i would be using will be of the length of around 250-300 bp.

Thanking You.

Best Regards,
Rahul Yadav

Hello Nicholas,
First of all I thank you for guiding me about the denoising step using DADA2.

But I am stuck again with the same step. When I am setting --p-trunc-len as “0” (ZERO), I am getting a varying length of rep-seq set and the otus as well. The outputs of the denoising step that are needed in the clustering step do not work for --p-trunc-len 0 (the output file of denoise at trunc len 0 ) but works for --p-trunc-len 200 or any value other than 0.
I also read the DADA2 Paper, where they say that the tool need same length of sequences in order to denoise. Failing to have that the tool itself fails in denoising. And when denoising at --p-trunc-len 0, I am getting the representative sequences of varying length, which i feel is not denoising in this case.

I also tried to analyse by keeping the --p-trunc-len at 0 (ZERO) and --p-trunc-q at 20. But i am not able to proceed with further steps. However when I keep the --p-trunc-len anything other than 0(ZERO) and --p-trunc-q at 20 clustering works.

Please explain the reason behind this and how should i proceed for analysis and also deal with this kinds of errors.

Thanking You in advance.

Best Regards,
Rahul

Well yes that is half of the reason to use truncation. I had assumed that you were using paired-end reads, in which case the variable length should not be a big issue since you are joining the reads into full amplicons. Sounds like maybe you are using single-end reads, in which case you should definitely truncate at a specific length.

Sounds like you have already solved this: you need to truncate at a specific length so that you do not get ASVs of variable length gumming up your gears. The decision-making process for choosing a truncation length has been discussed numerous times on this forum (use the search function to browse) and in the tutorials on qiime2.org