I am repeatedly losing the majority of the reads in my samples after running dada2. To understand in which step the data is lost, I run dada2 externally and found that about half of the reads are removed after denoising, and almost all of the remaining reads failed to merge. Then, to verify that the problem was not with the samples I used, I run a mock community (mock18, taken from mockrobiota), and again, based on the sequence counts (files below) only 27% (!) of the reads remained after running dada2.
To my understanding, the prime reason for merging problem is an insufficient overlapping region between the reads. However, trying different truncation values didn't help and most of the reads were removed.
Also, trying to change with the maxEE value indeed reduced the amount of the removed reads, however the merging problem remained.
Hi @blau! From the demux.qzv you posted for mock-13, it looks like the reverse reads are pretty poor quality across the length of the sequences. My guess is that many of the reverse reads are being discarded during the denoising step, and there’s not much left over to merge.
You might try the following:
Process only the mock-13 forward reads and see how many sequences are discarded. @Nicholas_Bokulich has analyzed mock-13 forward-reads only with DADA2 and obtained reasonable results.
The first several positions in the mock-13 sequences are also low quality. You could try trimming off the first several positions from the forward and reverse reads using --p-trim-left-f and --p-trim-left-r. For example, you might try trimming the first 12 positions of the forward reads, and the first 5 positions of the reverse reads.
I think that sequencing artifacts have already been removed from the mock-13 reads. When you’re processing your own data, make sure that all sequencing artifacts (e.g. adapters, primers, barcodes) have been removed from the sequences before processing with DADA2.
The DADA2 functionality available in qiime2 assumes you have amplicon data and that it’s produced by an Illumina platform (i.e. for the error model DADA2 builds). mock-13 fits these requirements but you’ll want to verify that with your own data set.
If you have ITS data, you may need to trim off the reverse primers from the forward reads if they’re present (in addition to the other sequencing artifacts mentioned above).
Let us know how it goes! Sorry to not have a more clear/straightforward answer for you
Thanks for your reply.
I uploaded mock-13 as an example, but working with many other mocks resulted with an unreasonable amount of reads removed. It’s important to mention that I’m certain that all sequencing artifacts have been
removed and the dataset is indeed amplicon data, targeting the 16S gene.
I was wondering, how sensitive is dada2 to low quality datasets?
Comparison of the datasets quality plots from the tutorials (Moving pictures and FMT) to the mock communities’ quality plots leaves no doubt that the quality of the reads in the latest datasets is poorer.
Also, I run dada2 on a small, higher quality dataset. Again, it seems to perform better, removing an acceptable amount of reads (This makes sense of course, but should the difference be so big? the quality of the reads of the mocks is not perfect, but not very bad either).
Bottom line, is dada2 quality filtering less permissive than other quality filtering algorithms (such as the one used in Qiime1)? Would more reads be removed by using dada2 than qiime1, given an identical dataset and similar parameters?
I have encountered similar problems myself with some datasets. Unfortunately, sometimes the read quality (on reverse reads in particular) can be too low to permit merging forward/reverse reads. Currently available Illumina read lengths are typically long enough to cover some common 16S rRNA gene domains fully, e.g., V4 domain is fully covered by 300 nt reads, so merging may not always be necessary.
Some of the mockrobiota data sets are a few years old and the read quality in some runs is not great as you have noticed; I have always just used the forward reads so cannot offer much insight on using paired-end reads with these data.
That is going to depend on the quality of the data sets and where you do the trimming. If large sections of the read are erroneous, it may be better to set a lower --p-trunc-len-r; if the read quality drops off below the point at which reads overlap, however, you may be out of luck and these reads cannot be merged. @benjjneb may be able to comment on the issue of dada2 sensitivity more.
Absolutely (and results suggest that is a good thing). Qiime1 did not have a method for denoising Illumina data. The quality filtering methods used for Illumina data relied mostly on trimming and filtering reads based on PHRED scores and length. So dada2 will discard many more reads than qiime1 did because it offers functionality not available in qiime1. In that respect, similar parameters is beside the point. The dada2 paper illustrates that qiime1 with default parameters is much more permissive, but whether this would change with, e.g., more stringent quality filtering parameters (as described here) and chimera filtering methods, is untested to my knowledge.
I hope that answers your questions — please let me know if you have any more questions or concerns.
Two things. First, removing a large fraction of reads may not be a bad thing. Remember that the absolute counts don’t have meaning, this data is compositional hence only the relative abundances matter. Lower depth means more uncertainty in frequency estimates, but not that much more, and it is well-worth filtering out half (or more) of your data if that removes most of the errors!
Second, you do have something odd going on in this dataset that is leading to almost half the data failing to merge: the data seems to contain a mix of amplicon sizes. Around 60% of the amplicons are V4 (~253nts) while the rest seem to be V4V5 (same forward start, different reverse start, ~253+123=~376nts), and the merge step is generally losing one or the other. Might this data have been generated using a mixture of the 805R/926R reverse primers?
An aside, we/I should update the q2-dada2 plugin to report the fraction of reads making it through each step. That would make it easier to diagnose where issues like this arise within the multi-step pipeline the plugin implements.
@benjjneb thank you for your input, and thank you for reminding of something that I should have mentioned earlier. The mock-13, 14, and 15 datasets are bizarre datasets — they contains a mixture of 3 different amplicons (check out the original paper for more details). It sounds like @blau is having this issue with other datasets as well, though.
On other datasets, the number of reads passing each step is a very useful way to diagnose the issue. My basic troubleshooting workflow would go:
Problem – low fraction of reads making it through pipeline.
Is it really a problem? (see above)
What step are most being lost in?
If filtering: Try truncating sooner, especially before quality crashes, or raise maxEE.
If merging: Make sure your reads still overlap (by 20nt + amplicon-length-variation) after truncation.
If chimera removal: You probably need to remove primers from the reads.
If still not working – is this just really bad quality data? Especially the reverse reads? Try just forward reads?
That’s not going to cover every possible way things can go wrong (you don’t usually expect a mix of V3V4, V4 and V4V5 amplicons!) but in my experience that will usually get to the bottom of “lots of reads being lost”.
Before this topic will be closed, I thought to get your advise regarding dada2 quality control.
To try to understand where I lose my reads, I run dada2 externally. In the following table you can see the number of reads that are left after each step.
Again, the majority of the reads are removed, this time at the step of chimera removal.
The command I used for chimera removal is: