Denoising- how does it work? DADA2 is throwing out some samples

I’ve got around 80 different samples and 70% of them analyse. Ok. I’m running Miseq PE250 Nano kit and have trimmed 26 at each end and truncated to 220 on R1 and 170 on R2. If i tweak the truncation the odd one goes through. There is a commonish pattern that the ones that are failing have less reads. This is not 16S but Co1 and the original target is around 313bp. The message from DADA2 is ‘No features remain after denoising. Try adjusting your truncation and trim parameter settings’. In most cases all reads from each sample are a single (same) amplicon of the same species…is it struggling because it’s expecting biological differences???

Hi @bakerd,
Thanks for posting!

I think there are a few possibilities here, ranked in order of probability:

I believe this might be the issue — I cannot find where this was previously discussed on the forum but @benjjneb might be able to clarify this for us.

I do not believe dada2 is marker-gene specific but if Co1 is a funny marker (e.g., lots of repeating sections), some reads might look like chimera to dada2. Also, if Co1 has a high degree of length heterogeneity, it is possible that some reads pass through the reverse primer and into the adapter read. This non-biological sequence will look like chimera to dada2 and cause the read to be thrown out.

How many reads? If these samples have very low read counts, it is either consistent with the above point about read heterogeneity (fewer reads -> fewer true variants) or alternatively these may just be bad samples altogether (fewer reads because e.g., PCRs failed and you are left with largely noise/contaminant DNA that should be removed)

If one of these suggestions don’t get us any closer to the solution, could you please also:

  1. provide screenshots of your average read quality (from demux summarize)?
  2. run dada2 with the --verbose flag to print how many reads were filtered, denoised, merged, and non-chimeric.

Thanks!

Thanks. I’m leaning towards the first suggestion. Quality plots look fine for both looking at demux visualzation. Samples in some cases are straight replicates and one is failing and one is ok and read counts are the same. I do think lower read count is increasing the chances of it failing though. I’ll run the --verbose and have a look… Thanks for the quick response!

Second this, that diagnostic info would be very useful.

My first guess is that in the low abundance samples, some are being filtered to zero (or near zero), but the verbose output would help a lot to verify that idea.

Thanks… sorry as i’m sort of a lab guy, where can i put the --verbose…?

parallel qiime dada2 denoise-paired \
–i-demultiplexed-seqs {}dir/{}demux-paired-end.qza \
–o-table {}dir/{}table \
–o-representative-sequences {}dir/{}rep-seqs
–p-trim-left-f 26 \
–p-trim-left-r 26 \
–p-trunc-len-f 220 \
–p-trunc-len-r 170 :::: filenames.txt --verbose

It errors…

I think you should be able to add that flag anywhere, but

  1. looks like you are missing a backslash at the end of one line
  2. maybe the bash tricks you’ve thrown in there are causing a problem.

Try

parallel qiime dada2 denoise-paired --verbose \
–i-demultiplexed-seqs {}dir/{}demux-paired-end.qza \
–o-table {}dir/{}table \
–o-representative-sequences {}dir/{}rep-seqs \
–p-trim-left-f 26 \
–p-trim-left-r 26 \
–p-trunc-len-f 220 \
–p-trunc-len-r 170 :::: filenames.txt

Thanks that worked… Most say something like this…

Initializing error rates to maximum possible estimate.
Sample 1 - 422 reads in 118 unique sequences.
selfConsist step 2

Convergence after 2 rounds.
2b) Reverse Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 422 reads in 124 unique sequences.
selfConsist step 2

Convergence after 2 rounds.

  1. Denoise remaining samples
  2. Remove chimeras (method = consensus)
  3. Write output

ValueError: No features remain after denoising. Try adjusting your truncation and trim parameter settings.

Plugin error from dada2:

No features remain after denoising. Try adjusting your truncation and trim parameter settings.

No other stats…

Thanks @bakerd

How does that relate to the number of input sequences?

It looks like (at least in the output you have shown for Sample 1) that there is a reasonably large number of unique sequences, which would rule out the theory about lack of sequence diversity.

However, dada2 requires a certain level of replication for each read to be considered true sequence. If you have very low read counts in some sample, the issue may be too many unique sequences. If all are singletons, they will be removed.

Could you share your demux summarize results? That will give us reads per sample, and seeing the quality plots might help too. @benjjneb any thoughts?

From that my guess is that it’s a merging issue, as the denoising and chimera removal steps cannot ever remove all the sequences.

Can you share this small sample? Is this Co1 sample amplified using the same primers, and from the same sort of sample, as the rest that worked?

Yes, same primers and same type of samples. Rest have worked.

Thanks for all your help guys. I will have a think over the weekend. Agree that there is some diversity there. Could it be the odd N in the sequences? I don’t think i have any.

Ok, so reducing the trim at each end to 13 not 26 solves the problem and all samples are processed but I’m now only recovering 25% of the reads of each… I chose 26 because my Co1 specific primer is 26 and want to remove this from the analysis as it contains degenerate/wobble bases and could be slightly incorrect.

I’ve tested different trim lengths and the sweet spot is 23, 3 bases short of removing the entire primer sequence. I’m now seeing all data and keeping 92% of reads. Thanks @Nicholas_Bokulich and @benjjneb for all your help.

Hi @bakerd,
Glad to hear you got it working for you! Sounds a little strange that 23 works but not 26… but I don’t see a reason why that may be a problem. Just a couple thoughts to add:

Ideally the primer (or at least all wobble) should be removed. Degenerate bases will inflate the diversity present, which in some cases could lead to rare (but valid) sequence variants being removed as error. Sounds like you are probably safe to trim at 23 (degenerate bases in the last 3 bases seems a bit counter-intuitive!)

That is not necessarily a bad thing (though 92% is certainly more appealing!). I agree, it is a lot of data loss, but we have frequently seen yields like this with dada2 for 16S rRNA gene sequence data. Considering that a sizable fraction is often chimera and reads with some sequence errors present, such a high level of filtering is not necessarily surprising or unwarranted. Just something to keep in mind for future data sets.

1 Like

Thanks @Nicholas_Bokulich, makes sense what you say and actually forward and reverse primers end in YCC and YCA so actually trimming to 23 will leave a degenerate C/T at each end. Thanks again.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.