Q2-picrust2 hsp.py error with large dataset

jairideout · April 16, 2019, 9:51pm

I’m using q2-picrust2’s custom-tree-pipeline workflow as described in the q2-picrust2 tutorial to process a large ASV feature table (21,884 samples x 1,019,523 ASVs). I'm getting the following error, which seems to be a limitation of joblib/multiprocessing modules if the data being serialized is too big (references: [1], [2], [3]):

Traceback (most recent call last):
 File "/opt/conda/envs/qiime2-2019.1/bin/hsp.py", line 7, in <module>
   exec(compile(f.read(), __file__, 'exec'))
 File "/root/lib/picrust2-2.0.3-b/scripts/hsp.py", line 149, in <module>
   main()
 File "/root/lib/picrust2-2.0.3-b/scripts/hsp.py", line 134, in main
   ran_seed=args.seed)
 File "/root/lib/picrust2-2.0.3-b/picrust2/wrap_hsp.py", line 65, in castor_hsp_workflow
   for trait_in in file_subsets)
 File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/site-packages/joblib/parallel.py", line 934, in __call__
   self.retrieve()
 File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/site-packages/joblib/parallel.py", line 833, in retrieve
   self._output.extend(job.get(timeout=self.timeout))
 File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 521, in wrap_future_result
   return future.result(timeout=timeout)
 File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/concurrent/futures/_base.py", line 432, in result
   return self.__get_result()
 File "/opt/conda/envs/qiime2-2019.1/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
   raise self._exception
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
Error running this command:
hsp.py -i EC -t /data/.qiime-tmp/tmpyekwr7gs/placed_seqs.tre -p 8 -o /data/.qiime-tmp/tmpyekwr7gs/picrust2_out/EC_predicted -m mp

I'm using qiime2-2019.1 and picrust2-2.0.3-b with the following commands:

$ qiime fragment-insertion sepp
    --i-representative-sequences rep_seqs.qza
    --i-reference-alignment reference.fna.qza
    --i-reference-phylogeny reference.tre.qza
    --p-threads 8
    --o-tree sepp_placement/insertion_tree.qza
    --o-placements sepp_placement/insertion_placements.qza

$ qiime picrust2 custom-tree-pipeline
    --i-table feature_table.qza
    --i-tree sepp_placement/insertion_tree.qza
    --p-threads 8
    --p-hsp-method mp
    --p-max-nsti 2
    --output-dir predicted_metagenome

Do you have any recommendations for how to work around this error? Thanks for your help!

ebolyen · April 17, 2019, 4:53pm

cc @gmdouglas

Reading @jairideout's references above, it looks like this is something joblib just cannot handle.

@gmdouglas, is there the potential to share a file instead? @wasade does some neat tricks with HDF5 for unifrac parallelism.

wasade · April 17, 2019, 5:33pm

@ebolyen, I don't think this is related to HDF5.

It looks like a dense form of the table is being used via pandas, and a serialization of that is being sent over the socket. Based on the value in the traceback, it looks like >2GB are being sent per chunk. If the partitioning is by samples, then likely explains it as 500 samples x 1000000 features x 8 bytes is ~4GB

A possible solution may be to modify the chunk size when partitioning the table. I don't believe that parameter is exposed from the plugin unfortunately from what I can see -- @gmdouglas, what do you think?

ebolyen · April 17, 2019, 8:34pm

Sorry that is exactly what I had meant, instead of sending data in memory, share a view to the same file (like unifrac does).

However, just making the chunksize smaller is probably much easier.

gmdouglas · April 18, 2019, 5:27am

Thanks for looking into this @wasade. Any recommendations you or others have for how to make this more efficient would be appreciated! I'll have to look at what the unifrac plugin is doing when I get a chance.

Yes the chunk size parameter can be changed in the standalone version, but not the plugin version currently. For now you could either edit the q2-picrust2 recipe to pass that option to hsp.py when it's being run or run the standalone version. I'll be sure to add more options to the plugin for the next release!

wasade · April 18, 2019, 3:41pm

The parallelization strategy within Striped UniFrac is rooted in pairwise comparisons and expressed over partitions of the resulting distance matrix -- details of the approach are in the supplemental here. I don't believe it is pertinent, although I apologize if I'm looking at this incorrectly. One strategy I do recommend considering, if possible, is to partition a biom.Table as it is natively sparse, and would greatly reduce the space requirements.

jairideout · April 19, 2019, 10:08pm

Thanks everyone for your ideas on how to work around this error! I'll try a different chunk size and follow up on how it goes.

@gmdouglas, is there any way to run q2-picrust2 on smaller batches of samples and merge the predicted feature tables? This is related to a previous question I had about inconsistent placement of ASVs shared across samples when I tried this approach. I'm bringing up the question here because it might help with processing large datasets. Is it possible to process samples independently and merge using q2-picrust's full-pipeline command instead of q2-fragment-insertion + custom-tree-pipeline?

Thanks!

gmdouglas · April 20, 2019, 6:20am

Thanks for the recommendation! Is the biom GitHub page the best place to ask for help if I run into issues?

gmdouglas · April 20, 2019, 6:35am

I just realized that since this error occurred at the hsp step that the number of samples shouldn't be relevant - only the number of ASVs. You could run the ASVs in batches, but that would throw off the predicted pathway abundances. To get around this you would need to run the standalone pathway_pipeline.py script on the final table of predicted EC numbers.

For example, you could run 100,000 ASVs at a time through either of the full or sepp-based q2-picrust2 approaches. You would need to sum the predicted gene family abundances per batch / per sample. I'm not sure how easy that would be to do with existing qiime2 commands. You would then need to export the final EC table to BIOM and input it to pathway_pipeline.py to get pathway abundances.

In general, making ASV placement totally reproducible is still a major issue I need to address, so I would avoid re-placing the same ASV for different samples.

Hopefully that's clear! I wrote this on my phone as I'll be away from a computer the next few days.

wasade · April 20, 2019, 9:12pm

Yes! Feel free to kick me via email as well if necessary

Best,
Daniel

jairideout · April 23, 2019, 5:28pm

Thanks for the details on a workaround @gmdouglas, I'll try out this ASV batching strategy soon and report back on how it goes!