Lost of data with dada2

blau · October 9, 2017, 4:42pm

Hi all,

I am repeatedly losing the majority of the reads in my samples after running dada2. To understand in which step the data is lost, I run dada2 externally and found that about half of the reads are removed after denoising, and almost all of the remaining reads failed to merge. Then, to verify that the problem was not with the samples I used, I run a mock community (mock18, taken from mockrobiota), and again, based on the sequence counts (files below) only 27% (!) of the reads remained after running dada2.

The command I used:

qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza
--p-trunc-len-f 220 --p-trunc-len-r 220 --p-n-threads 0
--o-representative-sequences rep-seqs.qza
--o-table table.qza

table.qza (8.4 KB)
demux.qzv (262.8 KB)
table.qzv (279.4 KB)

To my understanding, the prime reason for merging problem is an insufficient overlapping region between the reads. However, trying different truncation values didn't help and most of the reads were removed.
Also, trying to change with the maxEE value indeed reduced the amount of the removed reads, however the merging problem remained.

Links to f+r reads:
https://s3-us-west-2.amazonaws.com/mockrobiota/latest/mock-13/mock-forward-read.fastq.gz
https://s3-us-west-2.amazonaws.com/mockrobiota/latest/mock-13/mock-reverse-read.fastq.gz

Any help would be appreciated.
Thanks in advance,
Dor

jairideout · October 10, 2017, 8:03pm

Hi @blau! From the demux.qzv you posted for mock-13, it looks like the reverse reads are pretty poor quality across the length of the sequences. My guess is that many of the reverse reads are being discarded during the denoising step, and there's not much left over to merge.

You might try the following:

Process only the mock-13 forward reads and see how many sequences are discarded. @Nicholas_Bokulich has analyzed mock-13 forward-reads only with DADA2 and obtained reasonable results.
The first several positions in the mock-13 sequences are also low quality. You could try trimming off the first several positions from the forward and reverse reads using --p-trim-left-f and --p-trim-left-r. For example, you might try trimming the first 12 positions of the forward reads, and the first 5 positions of the reverse reads.
I think that sequencing artifacts have already been removed from the mock-13 reads. When you're processing your own data, make sure that all sequencing artifacts (e.g. adapters, primers, barcodes) have been removed from the sequences before processing with DADA2.
The DADA2 functionality available in qiime2 assumes you have amplicon data and that it's produced by an Illumina platform (i.e. for the error model DADA2 builds). mock-13 fits these requirements but you'll want to verify that with your own data set.
If you have ITS data, you may need to trim off the reverse primers from the forward reads if they're present (in addition to the other sequencing artifacts mentioned above).

Let us know how it goes! Sorry to not have a more clear/straightforward answer for you

Thanks @ebolyen and @Nicholas_Bokulich for these suggestions!

blau · October 11, 2017, 4:21pm

Hi @jairideout,

Thanks for your reply.
I uploaded mock-13 as an example, but working with many other mocks resulted with an unreasonable amount of reads removed. It’s important to mention that I'm certain that all sequencing artifacts have been
removed and the dataset is indeed amplicon data, targeting the 16S gene.

I was wondering, how sensitive is dada2 to low quality datasets?
Comparison of the datasets quality plots from the tutorials (Moving pictures and FMT) to the mock communities’ quality plots leaves no doubt that the quality of the reads in the latest datasets is poorer.
Also, I run dada2 on a small, higher quality dataset. Again, it seems to perform better, removing an acceptable amount of reads (This makes sense of course, but should the difference be so big? the quality of the reads of the mocks is not perfect, but not very bad either).

Bottom line, is dada2 quality filtering less permissive than other quality filtering algorithms (such as the one used in Qiime1)? Would more reads be removed by using dada2 than qiime1, given an identical dataset and similar parameters?

Thank you again.
Best,
Dor

Nicholas_Bokulich · October 11, 2017, 4:57pm

Hi @blau,
I have encountered similar problems myself with some datasets. Unfortunately, sometimes the read quality (on reverse reads in particular) can be too low to permit merging forward/reverse reads. Currently available Illumina read lengths are typically long enough to cover some common 16S rRNA gene domains fully, e.g., V4 domain is fully covered by 300 nt reads, so merging may not always be necessary.

Some of the mockrobiota data sets are a few years old and the read quality in some runs is not great as you have noticed; I have always just used the forward reads so cannot offer much insight on using paired-end reads with these data.

That is going to depend on the quality of the data sets and where you do the trimming. If large sections of the read are erroneous, it may be better to set a lower --p-trunc-len-r; if the read quality drops off below the point at which reads overlap, however, you may be out of luck and these reads cannot be merged. @benjjneb may be able to comment on the issue of dada2 sensitivity more.

Absolutely (and results suggest that is a good thing). Qiime1 did not have a method for denoising Illumina data. The quality filtering methods used for Illumina data relied mostly on trimming and filtering reads based on PHRED scores and length. So dada2 will discard many more reads than qiime1 did because it offers functionality not available in qiime1. In that respect, similar parameters is beside the point. The dada2 paper illustrates that qiime1 with default parameters is much more permissive, but whether this would change with, e.g., more stringent quality filtering parameters (as described here) and chimera filtering methods, is untested to my knowledge.

I hope that answers your questions — please let me know if you have any more questions or concerns.

benjjneb · October 11, 2017, 8:04pm

Hi @blau

Two things. First, removing a large fraction of reads may not be a bad thing. Remember that the absolute counts don't have meaning, this data is compositional hence only the relative abundances matter. Lower depth means more uncertainty in frequency estimates, but not that much more, and it is well-worth filtering out half (or more) of your data if that removes most of the errors!

Second, you do have something odd going on in this dataset that is leading to almost half the data failing to merge: the data seems to contain a mix of amplicon sizes. Around 60% of the amplicons are V4 (~253nts) while the rest seem to be V4V5 (same forward start, different reverse start, ~253+123=~376nts), and the merge step is generally losing one or the other. Might this data have been generated using a mixture of the 805R/926R reverse primers?

benjjneb · October 11, 2017, 8:15pm

An aside, we/I should update the q2-dada2 plugin to report the fraction of reads making it through each step. That would make it easier to diagnose where issues like this arise within the multi-step pipeline the plugin implements.

Nicholas_Bokulich · October 11, 2017, 8:19pm

@benjjneb thank you for your input, and thank you for reminding of something that I should have mentioned earlier. The mock-13, 14, and 15 datasets are bizarre datasets — they contains a mixture of 3 different amplicons (check out the original paper for more details). It sounds like @blau is having this issue with other datasets as well, though.

sounds like a useful output.

benjjneb · October 11, 2017, 8:55pm

That is strange!

On other datasets, the number of reads passing each step is a very useful way to diagnose the issue. My basic troubleshooting workflow would go:

Problem -- low fraction of reads making it through pipeline.

Is it really a problem? (see above)
What step are most being lost in?
If filtering: Try truncating sooner, especially before quality crashes, or raise maxEE.
If merging: Make sure your reads still overlap (by 20nt + amplicon-length-variation) after truncation.
If chimera removal: You probably need to remove primers from the reads.
If still not working -- is this just really bad quality data? Especially the reverse reads? Try just forward reads?

That's not going to cover every possible way things can go wrong (you don't usually expect a mix of V3V4, V4 and V4V5 amplicons!) but in my experience that will usually get to the bottom of "lots of reads being lost".

jairideout · October 11, 2017, 9:54pm

that'd be great! Linking to the issue for tracking purposes, we'll follow up here when it's available in a qiime2 release!

blau · October 12, 2017, 8:03am

Thank you all for the input, it's very useful.
Adding information regarding the removed reads after each step would be handy, and will definitely help troubleshooting such issues.

Best,
Dor

blau · October 17, 2017, 10:50am

Hello,
Before this topic will be closed, I thought to get your advise regarding dada2 quality control.
To try to understand where I lose my reads, I run dada2 externally. In the following table you can see the number of reads that are left after each step.

Again, the majority of the reads are removed, this time at the step of chimera removal.
The command I used for chimera removal is:

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)

The primers have been removed from the reads. What could be the reason for losing ~80% of the reads in each sample at this step?

Best,
Dor

benjjneb · October 17, 2017, 9:50pm

Hm... could you share the raw sequence data that produced that table with me?

I don't think I've ever seen that kind of extreme drop-off If there really weren't still primers (or other ambiguous nucleotides) on the reads, so am very curious as to what's going on!

blau · October 18, 2017, 7:07am

OOps, my bad. You are right @benjjneb, the primers were indeed hidden there, somehow I missed it. sorry.

ebolyen · December 22, 2017, 5:53pm

The DADA2 plugin in QIIME 2 2017.12 now prints how many reads were filtered, denoised, merged, and non-chimeric to stdout! (You can capture that via the --verbose flag.)