Plugin lifecycle - large files

mroper · March 8, 2023, 12:10am

Hi

I have an additional question regarding a plugin.

The plugin has a multiple file directory object pointing to single file objects, some of which are very large CSV files that crash pandas. An on-disk solution in validate handles this both in the directory level object validate and also the single file level validate methods. But reading of the CSV file is done in both the directory level and file level validate at present which is very slow. Is there some way that I can maintain the state of the on-disk storage object layer in between calls to validate (and also in plugin functions) so that the CSV file reading does not need to be called repeatedly?

Thanks!

gregcaporaso · March 9, 2023, 8:43pm

Hi @mroper,
Are you iterating over the whole file for validation? It's pretty common with large files to only do a small sanity check during validation (e.g., look at the first 10 lines of a large .tsv file). Here's an example I wrote recently. I realize this doesn't answer your question directly, but if you're currently reviewing the whole file for validation this would save a lot of time.

mroper · March 12, 2023, 12:04pm

Hi @gregcaporaso ,

Thanks for your comment ...

I have thrown out a lot of checks to what I consider the bare necessary ... but I am going through the whole file. I consider this necessary.

So, Im afraid the question still stands ....

A thought is that the on-disk solution converts the file to hdf5 format for subsequent reading. Is there some way that I can find out which tmp directory the qiime2 objects Im handling are unzipped to? That way I can just stash the hdf5 files in there I suppose as an easy way to maintain state ?

Thanks!

mroper · March 12, 2023, 12:51pm

Hello again @gregcaporaso ... what actually is the lifecycle of qiime2 objects from unzipping to destruction?

Thanks

jwdebelius · March 13, 2023, 12:24am

Hi @mroper,

I'm not sure if this is helpful, but I've used the dask python library for reading/handling large files. I'm not sure if it will work in your exact case, but it has a nice delayed structure that gets speed from parallel processing and can handle large data as long as it can be chunked. It's slower than pandas if you can't run in parallel, but has worked well for my purposes. There are specific operations that are expensive (sorting, groupby), but a lot of things can be done quickly.

Best,
Justine