Running qiime2 on HPC most efficiently

Analissa_Sarno · February 14, 2018, 3:48pm

Hi Q2 team!

I am running Qiime2 on our institutions HPC and I'm wondering what the most efficient configuration is? What is the optimal HPC environment configuration, number of nodes and threads, memory per thread, IO, memory or CPU intensive etc?

Specifically, I am interested in the multi threadable commands such as:
qiime vsearch cluster-features-open-reference
qiime feature-classifier classify-consensus-vsearch

I am also wondering if checkpointing is something that Q2 is planning on implementing for these more time intensive commands?

Thank so much for your input!

colinbrislawn · February 14, 2018, 8:02pm

Hello Analissa,

I would also like to hear what the qiime devs recommend. I know that estimating optimal settings can be hard because it depends both on the size and complexity of the data sets.

I also make use of HPC environments at work and have found that for vsearch database search scripts, like the two you mention, I usually use

1 node
vsearch --threads [number of threads on that node]

My reference database fits easily into the 64 GB of ram on each machine, and our IO can keep up with the 24 threads of search (which is mostly CPU heavy).

Amplicon data sets are smaller and simpler than most of the projects folks run here. Because of their size, I can easily process a full run on only a single node, in under an hour. At this scale, I don't need to optimize.

What kinds of bottlenecks are you running into?

Colin

ebolyen · February 16, 2018, 11:52pm

Hi @Analissa_Sarno,

Unfortunately I don't think any of us really have the answer to that.

That's actually one of the long-term goals for pipelines (actions that are made of other actions). Since there's an artifact in between steps, we could conceivably just "replay" the provenance to see how far into the pipeline we are if it dies.

system · March 20, 2018, 5:53am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.