I was trying to work through the steps of the Atacama Tutorial, and I ran into an error at the DADA2 step. I tried running that step with the files that I generated from the previous step, and then I tried it again using the file available in the tutorial. Both times it failed in seemingly identical ways (DADA return code -9, with the log seeming to indicate that “The filter removed all reads”)
This is the command from the command line (Note that demux-2.qza is the downloaded file – I didn’t want to overwrite my file):
qiime dada2 denoise-paired
This is the error:
Plugin error from dada2:
An error was encountered while running DADA2 in R (return code -9), please inspect stdout and stderr to learn more.
Debug info has been saved to /tmp/qiime2-q2cli-err-x51nddpd.log
These are the contents of the error log:
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
R version 3.5.1 (2018-07-02)
Loading required package: Rcpp
DADA2: 1.10.0 / Rcpp: 1.0.3 / RcppParallel: 4.4.4
Filtering The filter removed all reads: /tmp/tmpfxkmf8do/filt_f/BAQ1370.1.3_57_L001_R1_001.fastq.gz and /tmp/tmpfxkmf8do/filt_r/BAQ1370.1.3_57_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmpfxkmf8do/filt_f/BAQ1370.3_71_L001_R1_001.fastq.gz and /tmp/tmpfxkmf8do/filt_r/BAQ1370.3_71_L001_R2_001.fastq.gz not written.
Thanks for the comprehensive post, @AaronW. How much RAM do you have allocated for your Virtualbox instance?
I’m usually inclined to believe error messages, and this one is quite clear about what it thinks the problem is. However, I’ve been unable to reproduce the error locally with either the 1% or 10% atacama data sets, and we see a lot of return code -9 messages when virtualbox images aren’t given enough memory.
I did try to do it with the 1% dataset to reduce the memory requirement, and it continued to give the same error.
I also just realized that there's a second error in the error log. It's an error having to do with the learning rate, which I don't think resolves the filtering problem. But I've included the whole log here in case there was some extra information that helps that I failed to copy over the first time. (I added .txt to the end of the file so that it would upload.)
Thanks for sharing that complete log, @AaronW. That definitely confirms some suspicions. There are some samples with no data passing the filtering step, but as you can see, that’s not the cause of the error. Those samples may be empty, but DADA2 happily tells you which ones they were, and moves on to the next step, where everything promptly blows up.
Can you try allocating your VM more ram, and re-running? I’m a little surprised the 1% data set is blowing up with 2gb, but that’s where I’d put the good money for solving this problem.
I increased the RAM and it seemed to fix the problem for both. Thanks!
Has anyone done any timed benchmarking with these tutorial scripts? It took about 27 minutes to complete the DADA2 step on the 10% data set. I’m admittedly running it on my old computer (Intel i5-650 3.2GHz dual core, 4G RAM), and so I expected poor performance. But this has me worried that even if I move things over to my current computer (Intel i5-8500 4GHz 6-core, 8G RAM) that I’m still going to be outmatched if I try to go to a full-sized data set.
Right now, I’m just “tinkering” with QIIME2 to see whether this is something I think I can reasonably pursue with the resources I have, or whether I will need to look for more computing power before being able to give this any real consideration. I did a forum search, but didn’t find any information of this type.
Thanks for the help and for any extra insight that you can provide.
Glad to hear that solved your problem, @AaronW! I’m not sure whether anyone has done any benchmarking, but I can give you some anecdotal information.
First, you’ll probably have significantly better luck on your newer machine, as it will improve your performance with both cpu-bound and memory-bound commands. Depending on the size of your real data, you can probably get through a complete analysis with that setup.
Second, if you continue running into out-of-memory issues, many commands have parameters that allow you to trade longer runtimes for OOM-error-avoidance, by “chunking” your data, reducing the number of permutations, or otherwise decreasing memory required. These will be exposed in the docs for any given plugin, and a quick forum search for your command and error will often yield a path forward.
Third, people with bigger data or older computers sometimes rent servers for a short time for an analysis. This may sound crazy, but can be a super-cheap way to get some work done without having to clutter up your workstation for a day. (Cheap as in a couple bucks an hour for massive resources.) Again, a quick forum search for “rent a computer” or “rent a server” will turn up some links, and the docs offer install guides for native linux and for aws instances.