Can QIIME2 Deblur handle 3 terrrabytes of data?

sthansen · August 7, 2019, 7:00pm

I have a very large dataset that we will be running deblur on tomorrow. There are ~30,000 files, resulting in ~3 terrabytes of data. If I run QIIME2 deblur, will this be a duplication of space? Is there a limit within QIIME2 as to how much data it is willing to handle for this command? And finally, if it is too much, will the command fail immediately or would it make an attempt and fail when it runs out of memory or space?

Our alternative solution is to install and run deblur directly rather than via QIIME2, and we didn't want to waste a lot of time discovering it wasn't going to work if possible. Any insight is appreciated!

ebolyen · August 7, 2019, 9:11pm

Hi @sthansen,

This is a really great question, for which I don't have an empirical answer, but here's some theoretical challenges you may face:

The zip file format itself. We use zip64 which should support 3tb, but I've never made one so large, so I'm uncertain how well it will actually work.
The fact that we put the data in a zip file (this is really the problem I think). You will end up with a 3tb zip file, which we then extract almost immediately, meaning you need at least 9tb just to perform the single operation (not including storing the dereplicated output data). This is precisely a reason to have either "cached" or "reference-only" artifacts which point at a location on disk, or to use a FUSE driver. Neither of these things are currently implemented, they are kind of "someday, someone will have an impossible amount of data and we'll cross that bridge then" situations.

Re: failure, I expect QIIME 2 to go as far as the system resources permit, and then fail, so don't expect anything too pleasant in this situation.

In terms of getting work done, this is your best option. In terms of "I would really like to know what happens", please consider both if you have the disk space available

sthansen · August 8, 2019, 6:10pm

Since we are also curious what will happen, we're going to run just plain deblur so we can get the result, and then run deblur via QIIME2. This way if it runs out of space it won't crash both processes. I'll try to update in the next week or so what happens.

(One year later) we originally just installed and ran Deblur, but we recently had to move the data and rerun Deblur, so I did it via qiime2 this time around as we had more processing space than we did previously. Ran without a hitch!

Mehrbod_Estaki · August 8, 2019, 7:58pm

In case there are memory issues you can always split your samples into multiple files and run them separately and merge after (as long as the trim/truncate parameters are identical). This is indeed one of the key designs of deblur, making it easy to combine multiple samples/studies at different times.

system · September 9, 2019, 1:58am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.