I’m developing a package I’m hoping to eventually turn into a QIIME 2 plugin. (So feel free to kick this over to other tools if its not appropriate here.) Its relatively memory intensive, so there are steps that are broken into smaller, parallelizable loops.
The package is currently in pure python and something like cython is a bit intimidating for this scale/type so I’d prefer to stay away. I’ve used the parallel module before with varying degrees of success, so I’m wondering if there’s a better way to do parallel processing?
I think there’s a few options depending on how involved the parallelism needs to be:
multiprocessing module is probably the first things to reach for, as it handles your basic “do the same thing, but X times on these X things”, it gets a little more annoying when you need to communicate between processes, but that can be done via Queue and Pipe well enough. I would avoid
concurrent.futures (also in stdlib), we use it for asynchronous execution on q2 actions (via
.asynchronous()), and honestly it’s awful.
An alternative to that would be
joblib. It is a slightly more barebones API which can be nice, but it can’t handle communication at all (beyond returning a result of course), so it’s a “set it and forget it” kind of parallelism. The upshot is it has multiple “backends” which is useful in principle, but I’m not sure I’ve ever really used it. (If/when we add explicit support for parallelism in Q2, you can expect we will have a custom backend to interface with joblib, as sklearn uses this extensively, and that’s obviously pretty important).
And then for full control of anything, you could take a look at
dask which is probably one of the most comprehensive systems I have seen. Most likely q2 parallelism will be written using
dask.distributed as it already interfaces with joblib (and evidently many more libraries than last I looked).
Thank you! I will look into dask and joblib!