I’m using q2-picrust2’s custom-tree-pipeline workflow as described in the q2-picrust2 tutorial to process a large ASV feature table (21,884 samples x 1,019,523 ASVs). I’m getting the following error, which seems to be a limitation of joblib/multiprocessing modules if the data being serialized is too big (references: [1], [2], [3]):
Traceback (most recent call last):
File "/opt/conda/envs/qiime2-2019.1/bin/hsp.py", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/root/lib/picrust2-2.0.3-b/scripts/hsp.py", line 149, in <module>
main()
File "/root/lib/picrust2-2.0.3-b/scripts/hsp.py", line 134, in main
ran_seed=args.seed)
File "/root/lib/picrust2-2.0.3-b/picrust2/wrap_hsp.py", line 65, in castor_hsp_workflow
for trait_in in file_subsets)
File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/site-packages/joblib/parallel.py", line 934, in __call__
self.retrieve()
File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/site-packages/joblib/parallel.py", line 833, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 521, in wrap_future_result
return future.result(timeout=timeout)
File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
Error running this command:
hsp.py -i EC -t /data/.qiime-tmp/tmpyekwr7gs/placed_seqs.tre -p 8 -o /data/.qiime-tmp/tmpyekwr7gs/picrust2_out/EC_predicted -m mp
I’m using qiime2-2019.1 and picrust2-2.0.3-b with the following commands:
It looks like a dense form of the table is being used via pandas, and a serialization of that is being sent over the socket. Based on the value in the traceback, it looks like >2GB are being sent per chunk. If the partitioning is by samples, then likely explains it as 500 samples x 1000000 features x 8 bytes is ~4GB
A possible solution may be to modify the chunk size when partitioning the table. I don’t believe that parameter is exposed from the plugin unfortunately from what I can see – @gmdouglas, what do you think?
Thanks for looking into this @wasade. Any recommendations you or others have for how to make this more efficient would be appreciated! I’ll have to look at what the unifrac plugin is doing when I get a chance.
Yes the chunk size parameter can be changed in the standalone version, but not the plugin version currently. For now you could either edit the q2-picrust2 recipe to pass that option to hsp.py when it’s being run or run the standalone version. I’ll be sure to add more options to the plugin for the next release!
The parallelization strategy within Striped UniFrac is rooted in pairwise comparisons and expressed over partitions of the resulting distance matrix – details of the approach are in the supplemental here. I don’t believe it is pertinent, although I apologize if I’m looking at this incorrectly. One strategy I do recommend considering, if possible, is to partition a biom.Table as it is natively sparse, and would greatly reduce the space requirements.
Thanks everyone for your ideas on how to work around this error! I’ll try a different chunk size and follow up on how it goes.
@gmdouglas, is there any way to run q2-picrust2 on smaller batches of samples and merge the predicted feature tables? This is related to a previous question I had about inconsistent placement of ASVs shared across samples when I tried this approach. I’m bringing up the question here because it might help with processing large datasets. Is it possible to process samples independently and merge using q2-picrust’s full-pipeline command instead of q2-fragment-insertion + custom-tree-pipeline?
I just realized that since this error occurred at the hsp step that the number of samples shouldn’t be relevant - only the number of ASVs. You could run the ASVs in batches, but that would throw off the predicted pathway abundances. To get around this you would need to run the standalone pathway_pipeline.py script on the final table of predicted EC numbers.
For example, you could run 100,000 ASVs at a time through either of the full or sepp-based q2-picrust2 approaches. You would need to sum the predicted gene family abundances per batch / per sample. I’m not sure how easy that would be to do with existing qiime2 commands. You would then need to export the final EC table to BIOM and input it to pathway_pipeline.py to get pathway abundances.
In general, making ASV placement totally reproducible is still a major issue I need to address, so I would avoid re-placing the same ASV for different samples.
Hopefully that’s clear! I wrote this on my phone as I’ll be away from a computer the next few days.