I’m using q2-picrust2’s
custom-tree-pipeline workflow as described in the q2-picrust2 tutorial to process a large ASV feature table (21,884 samples x 1,019,523 ASVs). I’m getting the following error, which seems to be a limitation of
multiprocessing modules if the data being serialized is too big (references: , , ):
Traceback (most recent call last):
File "/opt/conda/envs/qiime2-2019.1/bin/hsp.py", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/root/lib/picrust2-2.0.3-b/scripts/hsp.py", line 149, in <module>
File "/root/lib/picrust2-2.0.3-b/scripts/hsp.py", line 134, in main
File "/root/lib/picrust2-2.0.3-b/picrust2/wrap_hsp.py", line 65, in castor_hsp_workflow
for trait_in in file_subsets)
File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/site-packages/joblib/parallel.py", line 934, in __call__
File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/site-packages/joblib/parallel.py", line 833, in retrieve
File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 521, in wrap_future_result
File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/concurrent/futures/_base.py", line 432, in result
File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
Error running this command:
hsp.py -i EC -t /data/.qiime-tmp/tmpyekwr7gs/placed_seqs.tre -p 8 -o /data/.qiime-tmp/tmpyekwr7gs/picrust2_out/EC_predicted -m mp
picrust2-2.0.3-b with the following commands:
$ qiime fragment-insertion sepp
$ qiime picrust2 custom-tree-pipeline
Do you have any recommendations for how to work around this error? Thanks for your help!
Reading @jairideout’s references above, it looks like this is something joblib just cannot handle.
@gmdouglas, is there the potential to share a file instead? @wasade does some neat tricks with HDF5 for
@ebolyen, I don’t think this is related to HDF5.
It looks like a dense form of the table is being used via
pandas, and a serialization of that is being sent over the socket. Based on the value in the traceback, it looks like >2GB are being sent per chunk. If the partitioning is by samples, then likely explains it as 500 samples x 1000000 features x 8 bytes is ~4GB
A possible solution may be to modify the chunk size when partitioning the table. I don’t believe that parameter is exposed from the plugin unfortunately from what I can see – @gmdouglas, what do you think?
Sorry that is exactly what I had meant, instead of sending data in memory, share a view to the same file (like
However, just making the chunksize smaller is probably much easier.
Thanks for looking into this @wasade. Any recommendations you or others have for how to make this more efficient would be appreciated! I’ll have to look at what the
unifrac plugin is doing when I get a chance.
Yes the chunk size parameter can be changed in the standalone version, but not the plugin version currently. For now you could either edit the q2-picrust2 recipe to pass that option to hsp.py when it’s being run or run the standalone version. I’ll be sure to add more options to the plugin for the next release!
The parallelization strategy within Striped UniFrac is rooted in pairwise comparisons and expressed over partitions of the resulting distance matrix – details of the approach are in the supplemental here. I don’t believe it is pertinent, although I apologize if I’m looking at this incorrectly. One strategy I do recommend considering, if possible, is to partition a
biom.Table as it is natively sparse, and would greatly reduce the space requirements.
Thanks everyone for your ideas on how to work around this error! I’ll try a different chunk size and follow up on how it goes.
@gmdouglas, is there any way to run q2-picrust2 on smaller batches of samples and merge the predicted feature tables? This is related to a previous question I had about inconsistent placement of ASVs shared across samples when I tried this approach. I’m bringing up the question here because it might help with processing large datasets. Is it possible to process samples independently and merge using q2-picrust’s
full-pipeline command instead of q2-fragment-insertion +
Thanks for the recommendation! Is the biom GitHub page the best place to ask for help if I run into issues?
I just realized that since this error occurred at the hsp step that the number of samples shouldn’t be relevant - only the number of ASVs. You could run the ASVs in batches, but that would throw off the predicted pathway abundances. To get around this you would need to run the standalone pathway_pipeline.py script on the final table of predicted EC numbers.
For example, you could run 100,000 ASVs at a time through either of the full or sepp-based q2-picrust2 approaches. You would need to sum the predicted gene family abundances per batch / per sample. I’m not sure how easy that would be to do with existing qiime2 commands. You would then need to export the final EC table to BIOM and input it to pathway_pipeline.py to get pathway abundances.
In general, making ASV placement totally reproducible is still a major issue I need to address, so I would avoid re-placing the same ASV for different samples.
Hopefully that’s clear! I wrote this on my phone as I’ll be away from a computer the next few days.
Yes! Feel free to kick me via email as well if necessary
Thanks for the details on a workaround @gmdouglas, I’ll try out this ASV batching strategy soon and report back on how it goes!