qiime dada2 denoise-paired on HPC computer

mefistofele82 · June 3, 2019, 12:15pm

Hi,

I'm running QIIME2 on a HPC of my University, unfortunately I'm experiencing some problems with the command in the object.
The HPC uses SLURM as job scheduling system, and you have to specify the time limit for each job you submit. For my account, the time limit is maximum 12hours.

First, I import the data as QIIME2 artifact, with the output demux-paired-end.qza
and then to denoise (trimming and quality filtering) the data I use dada2 with the following command line:

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-paired-end.qza
--o-table table
--o-representative-sequences rep-seqs
--o-denoising-stats dada2stats
--p-trim-left-f 10
--p-trim-left-r 10
--p-trunc-len-f 250
--p-trunc-len-r 250

Using a dataset of 25 samples, it doesn't work within 12 hours and it's always cancelled, not matter how many memory and CPSs I use.

My questions are:

Do you know if the this QIIME2 command could support n-tasks in SLURM to run it on multiple cores? Or does it support a paraIisation across multiple nodes?
Is it possible to chunk this command, to make it smaller and faster? For example, run the command for each sample separately instead of run it of one big file?

Thanks for your time

Nicholas_Bokulich · June 3, 2019, 1:18pm

dada2 will often take > 12 hr for a single job to run. You may want to discuss with your admin to see if you can increase the maximum time limit at least temporarily...

See the --p-n-threads parameter for this method

In theory yes you could break up the samples but this is probably not a good idea since it will impact the error model and alter denoising/chimera checking.

Look at multithreading to make faster, and see if your system admin will give you a longer timelimit...

mefistofele82 · June 3, 2019, 1:37pm

Hi!

thanks for your reply.
I can increase the time up to 24h max, do you think could it be enough?

I will have a look to --p-n-threads

thanks

Nicholas_Bokulich · June 3, 2019, 1:47pm

Yes — with multithreading 24h should be enough (exact runtimes are difficult to predict for dada2).

mefistofele82 · June 3, 2019, 2:02pm

Hi,

the option --p-n-threads corresponds to the number of tasks, is it right?
It means that I should specify the same number, in --ntasks for SLURM job submission.

So if --p-n-threads 24
sbatch --ntasks 24 etc.....

is that right?

thanks

ben · June 3, 2019, 3:34pm

Our uni uses slurm as well, here are my commands for running my pipeline:

#!/bin/bash
#SBATCH --job-name=My QIIME title # Job name
#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=MY EMAIL GOES HERE@email.email # Where to send mail
#SBATCH --ntasks=16 # Run on 16 nodes
#SBATCH --mem=64gb # Job memory request
#SBATCH --time=72:05:00 # Time limit hrs:min:sec
#SBATCH --output=serial_test_%j.log # Standard output and error log

Here is my DADA2 code:

qiime dada2 denoise-paired --p-n-threads 0 --i-demultiplexed-seqs ~/QIIME2_3_demux/demux.qza --o-table ~/QIIME2_4_DADA2/table.qza --o-representative-sequences ~/QIIME2_4_DADA2/rep-seqs.qza --o-denoising-stats ~/QIIME2_4_DADA2/denoising-stats.qza --p-trim-left-f 13 --p-trim-left-r 13 --p-trunc-len-f 146 --p-trunc-len-r 146

Even though I set 72 hours for the time, I have not had a run not finish overnight (<12 hours).

You could add threads which the command is there, I ask the slurm manager to allocate 64 gigs of ram and 16 nodes.