dada2 derep fastq error

hello,
I have encountered an error while running dada2 on a large number (1152) of 16S rRNA samples. It ran for about 3.5 days before it gave me an error, so I would like to trouble shoot what went wrong before I attempt to rerun it.

The sequencing data was obtained from JGI where each sample was returned as a single fastq file, which had already gone through some QC using BBDuk as follows:

"Sequence data for library was generated at the DOE Joint Genome Institute (JGI) using Illumina technology [1]. An Illumina library was constructed and sequenced 2x249 using the Illumina NovaSeq SP platform which generated 28,006 reads totaling 6,973,494 bp. BBDuk (version 38.90) [2] was used to remove contaminants [3], trim reads that contained adapter sequence and homopolymers of G's of size 5 or more at the ends of the reads and right quality trim reads where quality drops to 0. BBDuk was used to remove reads that contained 4 or more 'N' bases, had an average quality score across the read less than 3 or had a minimum length <= 51 bp or 33% of the full read length. Reads mapped with BBMap [2] to masked [5] human, cat, dog and mouse references at 93% identity were separated into a chaff file [4]. Reads aligned to common microbial contaminants [6] were separated into a chaff file [4]. The final filtered fastq contained 27,932 reads totaling 6,839,197 bp."

I imported the files into a QIIME2 artifact using:
qiime tools import
--type 'SampleData[SequencesWithQuality]'
--input-path Stands_Metadata.tsv
--output-path jgi-merged-demux.qza
--input-format SingleEndFastqManifestPhred33V2

After looking at the results of jgi-merged-demux.qza, I decided to run dada2 for further clean up steps as follows:
qiime dada2 denoise-single
--i-demultiplexed-seqs jgi-merged-demux.qza
--p-trim-left 7
--p-trunc-len 129
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza

After running for approximately 3.5 days, I encountered the following error:
"Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_single.R /var/folders/jh/tpmtltq512vf0sx9n51fzxyw0000gn/T/qiime2-archive-3kytlkby/f34da75b-ed05-4076-80bf-60e09a2d9ce8/data /var/folders/jh/tpmtltq512vf0sx9n51fzxyw0000gn/T/tmptkpwp71b/output.tsv.biom /var/folders/jh/tpmtltq512vf0sx9n51fzxyw0000gn/T/tmptkpwp71b/track.tsv /var/folders/jh/tpmtltq512vf0sx9n51fzxyw0000gn/T/tmptkpwp71b 129 7 2.0 2 Inf independent consensus 1.0 1 1000000 NULL 16

R version 4.0.3 (2020-10-10)
Loading required package: Rcpp
DADA2: 1.18.0 / Rcpp: 1.0.6 / RcppParallel: 5.1.2

  1. Filtering ................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
  2. Learning Error Rates
    139425870 total bases in 1142835 reads from 2 samples will be used for learning the error rates.
  3. Denoise samples .......................................................................................................................................................................................................................................................................................................................................................................Error in derepFastq(filts[[j]]) : Not all provided files exist.
    Execution halted
    Traceback (most recent call last):
    File "/Users/mongomac/opt/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2_dada2/_denoise.py", line 182, in _denoise_single
    run_commands([cmd])
    File "/Users/mongomac/opt/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2_dada2/_denoise.py", line 36, in run_commands
    subprocess.run(cmd, check=True)
    File "/Users/mongomac/opt/miniconda3/envs/qiime2-2021.4/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['run_dada_single.R', '/var/folders/jh/tpmtltq512vf0sx9n51fzxyw0000gn/T/qiime2-archive-3kytlkby/f34da75b-ed05-4076-80bf-60e09a2d9ce8/data', '/var/folders/jh/tpmtltq512vf0sx9n51fzxyw0000gn/T/tmptkpwp71b/output.tsv.biom', '/var/folders/jh/tpmtltq512vf0sx9n51fzxyw0000gn/T/tmptkpwp71b/track.tsv', '/var/folders/jh/tpmtltq512vf0sx9n51fzxyw0000gn/T/tmptkpwp71b', '129', '7', '2.0', '2', 'Inf', 'independent', 'consensus', '1.0', '1', '1000000', 'NULL', '16']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/mongomac/opt/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2cli/commands.py", line 329, in call
results = action(**arguments)
File "", line 2, in denoise_single
File "/Users/mongomac/opt/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/qiime2/sdk/action.py", line 244, in bound_callable
outputs = self.callable_executor(scope, callable_args,
File "/Users/mongomac/opt/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/qiime2/sdk/action.py", line 390, in callable_executor
output_views = self._callable(**view_args)
File "/Users/mongomac/opt/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2_dada2/_denoise.py", line 205, in denoise_single
return _denoise_single(
File "/Users/mongomac/opt/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2_dada2/_denoise.py", line 191, in _denoise_single
raise Exception("An error was encountered while running DADA2"
Exception: An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more."

Please advise on what caused this error to occur and how to correct it.

I am running this on a Macbook Pro (mid 2014) running Mojave OS 10.14.6 with 2.2GHz Intel Core i7 processor and 16 GB 1600 MHz DDR3 memory.
QIIME2 version 2021.4 installed in a conda environment using command line interface (cli).

Thank you any advice or help you can provide!

Mike Ricketts

Hello!
Probably 16 gb of RAM are not enough to process such amount of samples / reads.
You can try it on a stronger machine, or (better) to subdivide your samples by sequencing runs and denoise them separately with the same (!) parameters and then merge feature tables and rep-seqs.

Edit.
Hi @Michael_Ricketts
You also may use mutithread option (check q2-dada2 descripions) to speed up denoising (it will increase RAM requirements, though).

There are additional concerns regarding your dataset that I missed (thanks @Mehrbod_Estaki):

  1. In your post, it is stated:

But in the error log, we can see:

So either the center gave you the wrong info or you're using the wrong files (considering that your run took 3.5 days most probably sequencing center provided you with a wrong info).

  1. There are some issues with NovaSeq platform and Dada2 that you should be aware of, although I processed NovaSeq datasets in q2-Dada2 and in R version with modification to account for quality scores issues and didn't noticed drastic differences with my data (but probably you should check it as well).
1 Like

Hi Timur!
Thanks for your response! First, the text that I posted from the sequencing center that you reference above is from 1 out of 1152 similar files, so the numbers for reads and bp only apply to the one sample associated with this example, NOT the entire run. I was just using it as an example to show what the sequencing center had done.

Second, I will definitely look into the issues with using Dada2 with NovaSeq data, IF i can get to that stage. Thus my current question/problem:

I have since gained access to a more powerful computer (196 GB memory, 64 CPU) and am attempting to rerun the same script as above. While it hasn't crashed (yet), it has been running for 13 days now. Is this normal? How can I check to make sure it is working without killing it? Should I just kill it and restart it, or is that a big mistake?

I saw above that you recommended using the multithread option to speed it up (this machine should be able to handle it), so maybe I'll try running that in parallel.

Any other advice is greatly appreciated.

Mike

Hi!

In my opinion it is too long but I can't say for sure without working with the data.
I processed a dataset with 1600 samples on the server by splitting all samples according sequencing run and lanes, and it took about 1 hour becouse of parallel tasks, so I guess on a local machine it would take about several days.
As I already wrote, it is not recommended to denoise different runs /lanes together since it can affect error learning step in Dada2, leading to biased data.
So, if you have different sequencing runs /lanes my suggestion will be to abort it, split dataset by runs /lanes, denoise each subset in multithread mode with the same parameters (important) and merge outputed feature tables and rep-seqs.

1 Like

All these samples were on the same run, same lane I believe. But I will still try splitting them up. Thanks again! I'll let you know how it goes.

My run was eventually successful. The original run I started using on 1 thread (196 GB memory, 64 CPU) ended up finishing after 14 days. I also started another run in parallel on a different machine with same specs and it finished in 4 days.
thanks for your help!

mike

2 Likes