Dada2 denoise-paired Return Code -9 with Memory and Time to spare??

shastamcmillen · May 20, 2019, 7:35pm

I am getting a Return code -9 error while running a Dada2 denoise-paired batch script.

I have run the same script successfully from a separate but similar demultiplexed sequence file. After reading all the posts about error code -9 I increased the max RAM incrementally from 20 GB (the other sequence file ran successfully to this in ~5 hours) up to 64 GB, and bumped the time from 24h to 48h. The job fails with the same -9 error after only about 5 hours. It seems unlikely that two similar files should need such drastically different RAM. I used top command to monitor the job, and it never hit max RAM, so now I'm thinking it might be something else. Below is the exact script that was submitted to a slurm batch queue:

#! /bin/bash -l
#SBATCH -D /home/samcmill/myprojects/fcstudy
#SBATCH -o /home/samcmill/myprojects/fcstudy/slurm-log/fcstudydenoise-stdout-%j.txt
#SBATCH -e /home/samcmill/myprojects/fcstudy/slurm-log/fcstudydenoise-stderr-%j.txt
#SBATCH -J fcstudydenoise
#SBATCH -t 48:00:00
#SBATCH --mail-user=samcmillen@ucdavis.edu
#SBATCH —mail-type=ALL
#SBATCH —p high
#SBATCH --mem=64G
#SBATCH -n 1
#SBATCH -N 1

hostname

module load bio
source activate qiime2-2018.11
srun qiime 

cd ~/myprojects/fcstudy

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs demultiplexed-seqs_v42.qza \
  --p-trim-left-f 0 \
  --p-trim-left-r 0 \
  --p-trunc-len-f 240 \
  --p-trunc-len-r 240 \
  --o-table table_v42.qza \
  --o-representative-sequences rep-seqs_v42.qza \
  --o-denoising-stats denoising-stats_v42.qza
  --verbose

qiime feature-table summarize \
  --i-table table_v42.qza \
  --o-visualization table_v42.qzv \
  --m-sample-metadata-file metadata_v42.tsv

qiime feature-table tabulate-seqs \
  --i-data rep-seqs_v42.qza \
  --o-visualization rep-seqs_v42.qzv

colinbrislawn · May 21, 2019, 5:49pm

Hello Shasta,

Welcome to Qiime 2! :qiime2:

The -9 error code indicates that the sbatch job was ‘killed’ by the linux system. I think you are doing everything right inside the sbatch submission file, so I’m not sure why the system would choose to kill the job at around 5 hours.

Here’s two ideas:

look for the slurm standard out and standard error files. Maybe these will have info about what killed the job.
update to the newest version of Qiime, and see if the new version runs better.

Troubleshooting HPC jobs can be tricky, so let me know what you find.

Colin

shastamcmillen · May 21, 2019, 8:43pm

Thank you Colin,

I am working with tech support to see about updating.

I have attached the standard output and standard error files.

Shastafcstudydenoise-stdout-10822149.txt (2.3 KB)
fcstudydenoise-stderr-10822149.txt (836 Bytes)

colinbrislawn · May 22, 2019, 4:04pm

Thanks for this!

Did you take a look inside these files? I think I found a clue in the error file:

Error: Invalid value for "--i-data": File "rep-seqs_v42.qza" does not exist.

slurmstepd: error: Detected 1 oom-kill event(s) in step 10822149.batch cgroup.
Some of your processes may have been killed by the cgroup out-of-memory handler.

Does the rep-seqs_v42.qza file exist? Did dada2 finish?

Colin

shastamcmillen · May 22, 2019, 6:07pm

Dada2 never finishes so the missing files do not exist.

I used the updated version and it appears to be making a difference; the job has been running for 20 hr now. Hopefully it works! Will update once it finishes.

shastamcmillen · May 23, 2019, 2:59am

Update: it took about 30 hours, but it finished successfully!

I thought this might be valuable to share regarding estimating dada2 paired run times when considering the number of samples you have. The first run took around 5-6 hours with 8mil sequences and 95 samples. The second run took around 30 hours with comparable quality and sequence count but with 55 samples.

Still no idea what about 2018.11 version would cause the job to be killed, but really glad it was a simple fix.

Thanks again for your help, glad this is finally resolved!