conceptual justification of dada2 truncation prior to merging

Peter · December 12, 2019, 12:41pm

I think, it is a fundamentally questionable approach of dada2 to discard the reads with one or two mismatches. We all know that

toward the end of the reads there are uncertainty, and
that one of the primary theoretical basis of sequencing is that reads together will enforce and correct each other.
Although dada2 figures out the rough amplicon size and it could combine/join/merge the two sequences by "closed eyes", it ignores the valuable information ( discards the other 450 nucleotides of any number of read-pairs, even all of them) (in contrast with vsearch).
If dada was a human, I would call it a cynic and malignant behavior, but since it is just a code, I call it conceptual error.

Mehrbod_Estaki · December 12, 2019, 8:47pm

Hi @Peter,
I think dada2 and vsearch fundamentally differ enough that a comparison wouldn't really be fair to either one.
I think dada2 doesn't allow for mismatches because it assumes that by the time merging is occuring those nts are error-free, either by inferring the correct sequences, or by strictly discarding those that couldn't be resolved. So working under that assumption it probably makes sense to be strict and not allow mismatches. Vsearch doesn't do any inferring so of course it relies on other methods for quality control. I would agree that in this scenario sharing info between the merging region makes total sense.

Only within the overlapping sequences/regions though, that is what you meant right?

Yes and no? I think with PE reads the emphasis is on making sure they can be confidently joined, otherwise joining PEs with low quality overlap or lack therefore completely can lead to so much spurious reads.

colinbrislawn · December 12, 2019, 9:32pm

Yeah, @Mehrbod_Estaki is totally correct: Dada2 does error correction first, then merges the corrected reads, so it can't be an apples to apples comparison.

However...

and malignant

Jokes aside, I think it's worth discussing if this is a 'reasonable default.' Defaults become norms, so we got to pick good ones.

Colin

colinbrislawn · December 12, 2019, 9:39pm

In the future, this will be a setting you can change directly in Qiime 2!
(We are working on it over here!)

Peter_Kos · December 13, 2019, 8:33pm

I agree. And dada2 may do a really great job with its error model to predict which reads are the ones to discard and which are not, so that only the perfect ones are subjected to joining.

But I think that dada2 should not do that. If it can find out which nts are not correct or questionabe, then it should flag them. The nucleotides. Not the reads. And just flag/score them, but not discard the reads.

Since the day of the invention of quality scores, and fastq files, we do not want to discard the "less than 100.00% sure nucleotides throughout the whole length" sequences, so dada2 should not do that either. It is a great, great waste; even in one-directional sequencing. In this case the uncertain nts should be flagged or changed to Ns or treated accordingly.
But in paired-end sequencing (at joining) we have a better source of information for error correction. The theoretical calculations with any magic, statistics or the error models (based upon other [100000] reads) are necessarily inferior to the alignment with the opposite direction sequencing of the very same piece of DNA (or to the combination of the two approaches).

colinbrislawn · December 14, 2019, 8:34pm

Hey Peter,

Yeah, I don't really like the dada2 defaults either, and it's frustrating that some of them can't be changed within the q2-plugin right now. The good news is that more of these options are being added to the plugin, and all of them can be changed right now when running dada2 directly in R.

For 16S v4 on the Illumina miseq using 150 bp or 250 bp reads, I've found that the quality is quite high for most reads, and this this is a medium waste

For variable length, say ITS with 300 bp overlap, this can be a much bigger deal

Have you tried using dada2 in R?

Colin

Nicholas_Bokulich · December 14, 2019, 9:02pm

@Peter_Kos,
If you disagree with the methodological design of dada2, I recommend you can use one of the other denoising/otu clustering methods available in QIIME 2: deblur and vsearch, respectively. These all perform quite well and QIIME 2 is designed to give flexibility for users' own needs! I use all of these methods regularly for different purposes.

However, I disagree with your point here:

This simply is not supported by the evidence, in various published benchmarks: see the original publications for dada2, deblur, and also at least one independent benchmark that indicates that dada2 is at least as good as if not better than methods that join prior to denoising or otu clustering. If you have evidence to support this, please share and it will contribute to this discussion.

You can adjust the --p-max-ee parameter to set the threshold for removing sequences based on error profiles. But moreover, the goal is to trim away the low-quality sections at the 3' ends of your reads so that, ideally, the whole read is not tossed out.

I realize this clashes with your statement, that this is wasteful when read overlapping can correct for these errors and salvage the sequence. But that is not an error-proof approach either, as mismatched joins do occur.

dada2 is a somewhat conservative method and as far as I can tell has been optimized for precision, i.e., reduce false positives; it does so at the expense of throwing out some "not as bad" reads along with the unsalvageable. Is that such a bad thing? Is losing, say, 25% of your sequences in an unbiased fashion not preferable if it means that you obtain better diversity estimates? (better as in backed up by independent benchmark data)

@Peter_Kos unfortunately this will not help you, either, as you take issue with how the reads are trimmed and merged, which is a characteristic of dada2 in general and not specific to the q2-dada2 plugin — running directly in R does give more flexibility with different options but merging prior to denoising is still not an option, as it muddles the quality score profile.

Peter_Kos · December 16, 2019, 5:51pm

Thanks Colin,
I have not yet had the chance as 1st I want to test other denoising and error filtering/correction possibilities in qiime. I am amazed by the utility of the Provenanace info in the more-and-more complex data structure toward the end of the pipeline, so as the first try of the unlimited possibilities I wanted to stay in the framework of qiime2.
Nevertheless I must test dada2 in R later anyways, as being limited to 18 threads (or perhaps 20) is a similar waste that dada2 does to me with discarding my reads.

The thing is, that 2 years ago I used qiime1 on a set of 96 samples and the results made complete sense. E.g. samples from the same location (similar to "biological replicates") map together in the trees/heatmaps, PCoA, etc. Now I revisited the old dataset and dada2 discarded ALL reads form one data and left about 200 reads of the 65000 of another one. Based upon the old results, I can not really imagine that the now discarded reads were all wrong/erroneous/chimeras/contradicting_to_the_biological_truth. So I am trying to find out something to do some error-correction before dada2 because still I dislike the concept of discarding everything that is suspicious. But this concept may be inherently the part of dada2 design, so I am afraid that even in R I will have hard time with this.

Nicholas_Bokulich · December 16, 2019, 6:20pm

I can't be sure without seeing some data and getting some specifics on what you did, but it sounds like the issue here is probably related to trimming low-quality bases from the 3' ends of the reads, causing the reads to fail merging. Not because dada2 is dropping all reads due to the presence of errors (but it could be that too, I need to see the specifics). If you want to get our help troubleshooting that issue, please open a new topic and provide the usual info about what you ran, the quality score plots, and the dada2 stats output (as a QZV).

Paired-end reads in QIIME 1 would have been joined (e.g., with fastq-join) prior to quality filtering. To achieve a similar workflow in QIIME 2, and perhaps more to your liking, I recommend using q2-vsearch to join your paired-end reads, then use q2-deblur to denoise (or q2-vsearch to cluster otus). This is definitely an issue when comparing qiime1 vs. qiime2; the qiime1-style protocol is somewhat more permissive and joining prior to QC will save more paired-end reads that just barely join — but whether doing so leads to more errors is still a possibility (which deblur should sort out for you).

dada2 does not really discard everything that is suspicious — just anything that is so suspicious that it seems unlikely that dada2 can repair it! Otherwise that is the goal of dada2 — to identify likely errors and attempt to correct those errors to recover the true signal that is present. As a rule of thumb, I think this is a good way to interpret the outputs of dada2:

% reads discarded	Interpretation
0-10%	Extraordinarily good luck!
10-25%	Typical output from dada2
25-50%	Noisy data but adjusting parameter settings could improve your yields
50-75%	Low-quality data but possibly sub-optimal parameter settings, check your stats output!
75-100%	Something is wrong, see the stats output to diagnose: either the data are exceptionally poor quality, or truncation needs to be adjusted to (a) truncate more to prevent abnormal read filtering due to low-quality reads or (b) truncate less to permit paired-end read joining.

thermokarst · December 16, 2019, 6:41pm

Hi @Peter_Kos, I have a question about this:

I am confused about this - why do you need to run dada2 in R? q2-dada2 supports multiple threads via the --p-n-threads parameter... As far as being "limited" - that is a function of data size (and shape), as well as your computational resources. DADA2 (and q2-dada2) doesn't cap the number of threads you can utilize...

Peter_Kos · December 16, 2019, 7:48pm

Hi, this is sadly not so. I have had a thread here in the Forum related to this issue. My resources were not at all used up and still dada2 threw an error, for which eventually some of your colleague said that (1) I should run then dada2 on 18 threads, and (2) perhaps some R geeks could be able to fish out the cause.
I have tried various data sizes, numbers of threads and other options and the finding was that more than 18 threads give the error message and make dada2 die, even if I have plenty more threads, free memory, disk space, whatever.

Today, as I forgot the issue I have issued the next command and again got the same message, so this is more than reproducible:

qiime dada2 denoise-single --i-demultiplexed-seqs CLCmergSeq.qza --p-trunc-len 480 --p-n-threads 96 --o-table CLCmergSeq_table.qza --o-representative-sequences CLCmergSeq_RepSeq.qza --o-denoising-stats CLCmergSeq_Dada2DenoisingStats.qza

(qiime2-2019.10) kp@duna:~/Duna/VVV/CLCMerge$ cat /tmp/qiime2-q2cli-err-1g90tg3u.log
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_single.R /tmp/qiime2-archive-y21ppnfh/0fdcb51d-67d3-4868-bd04-4ee60522c738/data /tmp/tmp9mxwpqqs/output.tsv.biom /tmp/tmp9mxwpqqs/track.tsv /tmp/tmp9mxwpqqs 480 0 2.0 2 Inf consensus 1.0 96 1000000 NULL 16

R version 3.5.1 (2018-07-02)
Loading required package: Rcpp
DADA2: 1.10.0 / Rcpp: 1.0.2 / RcppParallel: 4.4.4

Filtering Error in names(answer) <- dots[[1L]] :
'names' attribute [45] must be the same length as the vector [44]
Execution halted
Traceback (most recent call last):
File "/home/kp/.conda/envs/qiime2-2019.10/lib/python3.6/site-packages/q2_dada2/_denoise.py", line 177, in _denoise_single
run_commands([cmd])
File "/home/kp/.conda/envs/qiime2-2019.10/lib/python3.6/site-packages/q2_dada2/_denoise.py", line 36, in run_commands
subprocess.run(cmd, check=True)
File "/home/kp/.conda/envs/qiime2-2019.10/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['run_dada_single.R', '/tmp/qiime2-archive-y21ppnfh/0fdcb51d-67d3-4868-bd04-4ee60522c738/data', '/tmp/tmp9mxwpqqs/output.tsv.biom', '/tmp/tmp9mxwpqqs/track.tsv', '/tmp/tmp9mxwpqqs', '480', '0', '2.0', '2', 'Inf', 'consensus', '1.0', '96', '1000000', 'NULL', '16']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/kp/.conda/envs/qiime2-2019.10/lib/python3.6/site-packages/q2cli/commands.py", line 328, in call
results = action(**arguments)
File "</home/kp/.conda/envs/qiime2-2019.10/lib/python3.6/site-packages/decorator.py:decorator-gen-457>", line 2, in denoise_single
File "/home/kp/.conda/envs/qiime2-2019.10/lib/python3.6/site-packages/qiime2/sdk/action.py", line 240, in bound_callable
output_types, provenance)
File "/home/kp/.conda/envs/qiime2-2019.10/lib/python3.6/site-packages/qiime2/sdk/action.py", line 383, in callable_executor
output_views = self._callable(**view_args)
File "/home/kp/.conda/envs/qiime2-2019.10/lib/python3.6/site-packages/q2_dada2/_denoise.py", line 212, in denoise_single
band_size='16')
File "/home/kp/.conda/envs/qiime2-2019.10/lib/python3.6/site-packages/q2_dada2/_denoise.py", line 188, in _denoise_single
" and stderr to learn more." % e.returncode)
Exception: An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

Nicholas_Bokulich · December 16, 2019, 7:57pm

For reference, here is the topic you are referencing:

So this boils down to memory limitations that you would encounter one way or another with dada2, and this is not specific to running dada2 in QIIME 2.

Peter_Kos · December 16, 2019, 9:06pm

Even if it is related to the fact that dada2 is inside qiime2 or not, this is still an enigmatic, unsolved problem of yet unknown roots. If a small part of the memory and 20% of the processors are used, the program does not crush. And it should not crush if 22% of the processors are used, but it indeed crushes.
I do not know if there is a bug in the plugin or there is a bug in dada2 or there is a bug in R or there is a bug in my Ubuntu, but still it is not how it should work. When I run fasttre, all threads run with 100% each, so the htop screen is all green like a meadow. Or in Blast, etc. But dada2 can not stretch over 18 threads, and it is not memory limitation, as it would be obviously visible.
(I found that Java also has some fancy hidden secrets around garbage collections, tricky heap size limits at 48Gbyte, so I suspect that perhaps some system-guru or R-geek can later find the clue, so when I have time I'll try and find such ppl in our University.)

colinbrislawn · December 16, 2019, 10:12pm

Thanks for writing me back, Peter,

One of my colleagues runs dada2 directly in R and prefers that over the plugin. The Qiime plugin should provide identical performance, but I think it's worth trying it both ways.

I appreciate that you are willing to spend the time learning the new platform and getting it working for you. That's what we are all trying to do on the forums. :qiime2:

Colin