What factors to consider when denoising with dada2

ptalebic · May 2, 2020, 3:04pm

Hi,

I am going to use dada2 for denoising step. I have imported my data generated using Illumina Miseq sequencing platform (V4 region amplicon sequencing). My data is also single-end reads.

I was wondering what factors I need to consider when runnig dada2.

Here is the quality plot of my reads

Given this quality plot, what value should I choose for --p-trunc-len-f? Is 254 appropriate as the quality starts to drop at 255?

And since I have only forward reads should I set p-trunc-len-r to 0 or I don't need to specify it at all?

Should I remove the first few bases?

To provide more information I need to mention that I obtained my data from ENA - European Nucleotide Archive using an accession number provided in a paper I read.

What is interesting is that after importing my data I noticed that the number of reads I have matches the number of reads obtained after quality control as stated in the paper. Therefore, I assume they have performed quality control before submitting data to ENA. If that the case, What should I do? Should I skip the denoising step? If so, how can I create my feature table? Or should I run dada2 and pass 0 to the parameters.

Thank you again for creating this amazing forum.

Mehrbod_Estaki · May 2, 2020, 11:31pm

Hi @ptalebic,

That sounds reasonable to me.

Since you have single-end reads, you'll want to use dada2 denoise single instead which only has one truncating parameter.

I don't think that's necessary, those have quite high quality

What kind of denoising have they done? Ideally you could get the unfiltered fastq files but if you can't, you could still run DADA2, though the error model may not be as accurate without the full run. Probably a better alternative would be to denoise with Deblur since it has a static error model and won't lose sensitivity as a result.

No, that's not going to beneficial.

good luck!

ptalebic · May 3, 2020, 2:52pm

Thank for your reply. The is no information on denoising step provided in the paper.

This paper Denoising the Denoisers: an independent evaluation of microbiome sequence error-correction approaches - PMC suggests DADA2 finds more ASVs and is better at finding rare organisms. If I also aimed to identify rare sequence variants, would you still recommend using Deblur?

Thanks again for your reply.

ptalebic · May 3, 2020, 3:51pm

Given this plot, the quality drops at position 255 but it again goes up after that! should I be strict in truncating at this length? One thing to note is that multiple positions before 255 have also a quality of less that 20. I am a bit confused and I don't know where to truncate my sequences.

Mehrbod_Estaki · May 4, 2020, 3:58am

Hi @ptalebic,

3/4 of the methods in the paper (DADA2, Deblur, UNOISE3) are what we refer to as denoisers. For deblur specifics you can see the original paper here.
It's been a while since I read that comparison paper linked, but I believe they used default settings for Deblur, which by default removes any ASVs with less than 10 counts across the whole dataset. DADA2 on the other hand by default removed singletons. So this is one step where some rare taxa may be discarded in Deblur but not DADA2. But that is a setting you can change. The idea of whether to trust those rare taxa as "real" is a whole other discussion. In that comparison paper they are using paired-end reads which will play into the strength of DADA2 more than Deblur. Deblur gets more conservative and discards more reads as the sequence reads become longer, so I'm not surprised it retaind less reads, which also means less likelihood of retaining rare ASVs. In your case however, you are operating with single-end reads, that means the gap between DADA2 and Deblur retaining rare taxon will become much narrower.

I would truncate before this point, say somewhere between 220-240. It's not an exact science (yet anyways), so you kind of have to play around with these and weight in your goals. If you were to truncate at 220 you would -probably- get a little bit more reads than at 240, but at 240 you -MAY- get a tiny bit more resolution because you have slightly longer reads. The thing is, the difference between these would be so minimal I doubt you would even notice it. So, I personally prefer just taking the conservative approach and trimming more of the bad quality reads to avoid false positives.

ptalebic · May 4, 2020, 12:19pm

Thank you for your amazing explanation.

system · June 4, 2020, 6:25pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.