I am running Qiime2 on our institutions HPC and I’m wondering what the most efficient configuration is? What is the optimal HPC environment configuration, number of nodes and threads, memory per thread, IO, memory or CPU intensive etc?
Specifically, I am interested in the multi threadable commands such as:
qiime vsearch cluster-features-open-reference
qiime feature-classifier classify-consensus-vsearch
I am also wondering if checkpointing is something that Q2 is planning on implementing for these more time intensive commands?
I would also like to hear what the qiime devs recommend. I know that estimating optimal settings can be hard because it depends both on the size and complexity of the data sets.
I also make use of HPC environments at work and have found that for vsearch database search scripts, like the two you mention, I usually use
vsearch --threads [number of threads on that node]
My reference database fits easily into the 64 GB of ram on each machine, and our IO can keep up with the 24 threads of search (which is mostly CPU heavy).
Amplicon data sets are smaller and simpler than most of the projects folks run here. Because of their size, I can easily process a full run on only a single node, in under an hour. At this scale, I don’t need to optimize.
Unfortunately I don’t think any of us really have the answer to that.
That’s actually one of the long-term goals for pipelines (actions that are made of other actions). Since there’s an artifact in between steps, we could conceivably just “replay” the provenance to see how far into the pipeline we are if it dies.