DADA2 - losses during the denoising step

mverce · April 9, 2021, 10:45am

Hey everyone,

Currently, I am analysing a data set of 30 V1-V3 samples of 35000 reads each (Illumina), using a conda-installed qiime2-2020.11. The problem I am struggling with is that an unusually large number of reads is lost during the denoising step of dada2 denoise-single.

Using --p-trunc-len 451 and the default --p-max-ee 2, 79 - 85 % reads pass filtering (median 81.7 %). This drops to only 44 - 68 % after denoising (median 52.7 %) and 27 - 52 % after chimera removal (median 34.2 %). Out of curiosity, I tried raising --p-max-ee to 3, 5, and 10, which raised the median after denoising up to 62.3 %, but the difference from reads passing the filtering in that case (median 98.37 %) is still huge.

I am using denoise-single because the data set is composed of pre-merged reads, with quality information for the whole merged read. For what it's worth, I did not notice decreases of this magnitude during denoising when analysing similar data sets of pre-merged reads, so I would not immediately call this the culprit. As these data sets were obtained by other people in the past and I am almost certain that the forward and reverse reads are not available anymore, I am working with what I've got at the moment.

Is there anything that I can do to find out what causes such a large decrease in reads during the denoising step, so that I can appropriately deal with it or know to accept the losses? If necessary, I can of course provide other pieces of information.

I would really appreciate any pointers, guidance, or insight into this matter!

Thanks,
Marko

llenzi · April 9, 2021, 11:34am

Hi @mverce,

If you could send the qzv file for the demultiplexed sequence would be really useful to assess the problem.
Other than that, just a couple of observations. First, for pre-merged sequences dada2 is not the right tool, because it requires the original quality scores, which are probably changed during the merging step. SO I would suggest to look at deblur which is not based on the quality scores and so it is better for your case!
As second point, I would test different truncating length instead of changing the '--p-max-ee' setting, especially with artificial quality scores as in your case.

Hope it helps
Luca

mverce · April 12, 2021, 3:06pm

Thank you @llenzi for your response! I attached the qzv file of demultiplexed sequences. The quality scores are increased in the area of overlap, which makes sense, since those base pairs are "sequenced twice", so to speak, no?

The truncation length was not an issue, since the reads were pre-merged, and the vast majority of reads were longer than the chosen truncation length.

I followed your suggestion and tried using deblur but the losses were even higher. As per one of the tutorials, I first did some quality filtering (min. quality 15), which resulted in an acceptable drop from 35000 to 30000 - 32500 reads per sample. After deblur, however, only 6000 reads per sample remained at most (median ca. 2200).

The strictness of deblur in treating singletons made me look at the percentage of singletons in the data sets after adapter/primer trimming and truncation of the length to the one I used for denoising. And indeed, this data set has the highest percentage of singleton sequences (ca. 85 %) among the data sets at my disposal. So I think that the high percentage of singletons is the cause of losses during denoising whether I use dada2 or deblur.

Any advice about what to do with the singletons or where they came from (artificial, biological ...) is very welcome! But I realise that this shifts the topic from the initial question and that there are many discussions about it on this forum that I have to check out.

Best regards,
Marko

demux-trimmed.qzv (298.6 KB)

llenzi · April 12, 2021, 3:46pm

Hi @mverce,

nice investigation!

The quality scores are increased in the area of overlap, which makes sense, since those base pairs are “sequenced twice”, so to speak, no?

Yes you right on that! However, your plot is a bit odd. After the overlapping region there is a drop which it may be problematic to me. It looks like before the merging they quality trimmed the forward sequences but not the reverse sequence.
I wonder if these low quality bases are confusing the denoisers.
A test you could do is to use deblur and trim to include only the forward reads (about at 250), to see if the number of singleton would be lower.

Best
Luca

timanix · April 13, 2021, 12:20pm

Hi all!
Currently I am struggling with one dataset, in which quality is dropping in the middle of forward and reverse reads. I suspected that the reason for it is a targeted region (ITS), but just tested another dataset, which targets the same region, and I do not have such problems with this one.
As in case with @mverce 's data, both Dada2 and Deblur are filtering out a lot of sequences on denoising step due to large amount of singletons, originated, as I suspect, from low quality regions in my reads. The only way I can retain larger amount of reads is performing OTU picking with VSEARCH instead of denoising reads to ASVs.
So I am tracking this topic to see if you will come up with better solution.
Best
Timur

mverce · April 19, 2021, 2:20pm

Hi all,

Some follow up: in the mean time, I was able to obtain the forward and reverse reads for this data set! However, I then stumbled on the more common issue of large losses during the merging step while using dada2 on this data set (V1-V3, pe-demux-trimmed.qzv with read qualities attached).

pe-demux-trimmed.qzv (318.7 KB)

I tried improving the yield of final non-chimeric reads by modifying the max-ee and trunc-len parameters:

Modifying these parameters increased the percentage of reads retained after the whole process, but only up to a point (20-46 % non-chimeric reads with the extreme settings of max-ee 10 for R1 and R2, median ca. 26.7 %) and at the cost of decreasing the reliability of sequences due to max-ee increases, as well as some very low-abundant extremely long ASVs, depending on the parameters. This leads me to the following questions:

How high a max-ee is still considered “ok”? Increasing max-ee up to 5 per read seems still somewhat accpetable to me, but is that intuition wrong here?

Given the low yields using reasonable settings (e.g., max-ee-f = 2, max-ee-r = 5, trunc-len-f = 280, trunc-len-r = 232), would it be better to just use the forward reads? I read on the forum about people using only forward reads, but how does one report that in eventual publications and how legitimate/acceptable is that?

Are there other strategies that would be better suited to deal with the low proporotion of merged reads due to amplicon lengths using dada2? Merging with some other tool and using VSEARCH as @timanix mentioned?

Best regards,
Marko

llenzi · April 20, 2021, 8:33am

Hi @mverce,

thanks for the update! I suspect that your truncating length do not allow the read to merge, you don't have enough overlap after the quality trimming (at least 12 bases are required).

To me, moving on with forward read is really the better way: you may loose a bit of resolution at lower level taxonomical classification but you don't put any bias in your data due to unequal merging. I can not find at the moment, but I am sure I saw a reference for the use forward read only somewhere here, if I find it again I report here.

For completeness, a possible merging way could be by using dada2 in R, which has an option to insert Ns in between the reads as gap filling. I never used it, so this is as far I know really, and I am not sure how this methods performs in taxonomy classification.

Hope it helps
Luca

mverce · April 20, 2021, 4:07pm

Thank you for your input @llenzi! It looks like balancing the qualities and truncation lengths can only get me this far under these circumstances, so for now I think I'll move on with forward reads, as you said.

All the best,
Marko

system · May 21, 2021, 10:07pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.