QIIME 2 Artifact Caching

ebolyen · March 15, 2019, 5:25pm

I was talking to @thermokarst the other day about how we might be able to cache things like database index files, and I wanted to write it down here (for input and so we don't forget).

What if there was a user managed artifact cache? This seems like it would avoid a lot of the issues associated with automated caching. Imagine the following scenarios:

I have a large bowtie2 index, and I want to reference it a lot. I can run the following command:

qiime tools send-to-cache --name sra1 my-index.qza

This will take the .qza and extract it into something like:

$XDG_CACHE_HOME/qiime2/b08b8ada-2b64-4548-9d21-e549dd9d0a4d/

and it would update some file:

$XDG_CACHE_HOME/qiime2/cache-manifest.yaml

to set this line:
sra1: b08b8ada-2b64-4548-9d21-e549dd9d0a4d

I would then be able use this artifact in any command with some syntax like:

--i-input-data {{ sra1 }}

You could imagine importing directly into the cache as well, skipping the zip file entirely:

qiime tools import-cache \
  --type ... \
  --input-path ... \
  --input-format ... \
  --output-name sra1

What's kind of neat about this is our object model doesn't really need things to be in a zip file at all, they get extracted to a temporary location with the same file structure as the archive. This means we really don't need much code at all to support the above, as we'd just instantiate an artifact from a different filepath and add some logic to retrieve/add/remove/inspect things from the cache

Questions for the community:

Does this workflow seem reasonable? Is it too much effort to think about and manage a cache?
(You could also imagine putting classifiers in the cache so that you don't have to find them on disk)
What specific names make sense here?
What should the syntax be for retrieval for q2cli? Something BASH safe that doesn't conflict with files would be nice, but that might be asking too much.
Where are these files actually stored? Is $XDG_CACHE_HOME generally configured to be large enough for this, or do we need a real configuration file to set this?

cc @epruesse

Nicholas_Bokulich · March 15, 2019, 5:56pm

Caching sounds pretty useful for artifacts that are used often, such as classifiers or other database artifacts, but are the benefits worth the effort? I also see some concerns with managing such a cache. A few questions/challenges to add to your list of questions:

is the end goal here to create aliases for easy retrieval across studies (e.g., for classifiers and other commonly used artifacts) or actually for caching important files, e.g., so they don't get lost?
what other types of artifacts could benefit from this? I could see something like a feature table, which is input to many actions in a single analysis, but this is still short-term and probably not a worthwhile target for caching.
Classifiers often need to be updated, e.g., every time :qiime2: uses a different release of sklearn, so even then this cached artifact would be temporary. Similarly, databases grow and are updated regularly (with at least one notable exception!) and classifiers (and other database artifacts) need to be retrained.
If this is for CLI users, they could just set a filepath alias in their bash profile to accomplish the same thing (or at least the retrieval part), no? The sort of power users who use QIIME 2 frequently enough with the same classifiers/databases and hence would find this cache useful are most likely already doing (or could do) something like this. For most QIIME 2 users, caching will not be as useful.

Managing such a cache also seems like a challenge for most users. So I suppose the bottom line is this: what users/how many would find artifact caching useful?

ebolyen · March 15, 2019, 6:41pm

Good questions/points!

I'm thinking the main use-case for this is for databases which are many gigabytes which means unzipping would take several minutes. Meanwhile the format within the database is likely optimized for random retrieval meaning doing this outside of QIIME 2 would have taken only a few seconds for access. I don't think it would be worth messing with the cache except for these large, optimized, reference indices/databases.

This would be an example. More generally, anything related to Bowtie2 or other shotgun metagenomics tools would really benefit.

Yep, that's definitely true, so the only benefit here is avoiding the zip/unzip behavior, while still letting QIIME 2 keep provenance and all of those other good things.

Not necessarily, as the interface still needs to load the Artifact in a different way as the cache doesn't contain .qza files, just plain-directory archives. Arguably, this could also be achieved by letting q2cli/other interfaces load artifacts that just aren't in a .qza, but I like the idea of a cache in that it prevents you from being "in charge" of the archive directories as they aren't "protected" in a zip file anymore.

Nicholas_Bokulich · March 15, 2019, 6:56pm

Thanks! The example makes the goal much more clear. This makes much more sense for something like bowtie2 databases especially if they are frequently/repeatedly accessed. As we may wind up with a few more plugins using bowtie2 databases, this could be a useful feature to have, though in the near term it seems to impact only a small number of users (anyone using SINA or shogun).

epruesse · March 15, 2019, 8:23pm

The user managed caching concept you outlined sounds awfully complicated to me. I don't think that's going to appeal to users much. The point of using a qiime2 plugin is to have things be simple.

I had pictured a cache feature in the Qiime2 SDK where a plugin can request that a QZA input be cached by just keeping the staging location between qiime invocations rather than deleting it at exit. That would allow tools to create their cache files next to the source data as they usually do - whether that's .bt2 or .pt or myindex type files.

Users are already familiar with that concept and its caveats. If you use conda you quickly learn to use conda clean lest your miniconda installation occupy tens of GBs for tarballs and old packages you no longer use. A single log message along the lines of Cached xyz.qza. Total cache size is now 2.3GB would suffice to cue users to find and use a qiime cache clean command should they need to free disk space.

If you want fancy, add a .qiime2rc where people can set the cache location, a maximum cache size, a maximum cache entry size. A simple LRU approach should make this work transparently for most use cases. If you want to be really fancy, have something that does a disk-vs-time cost analysis. But I don't think this is quite necessary at this time. KISS principle.

W.r.t. SINA: The most recent versions already have an internal fast index method that can index the latest SILVA Ref NR in about 4 minutes on my macbook and creates a ~400MB cache file that allows the next iteration to launch nearly instantly. (The ARB PT server based versions need about 1 hour to index and create some 2GB of index files.)

ebolyen · November 9, 2019, 12:02am

Resurrecting this conversation

I think if an interface emitted a message indicating that this had happened that could work well.

One thing that I like about a manually managed cache is the user gets to choose a name for the data, which means that they don't need to keep the .qza around once it has been added to a cache (and named).

I'm imagining this is more useful for shotgun metagenomics where the file sizes can become overwhelming and we really don't want this data in a zip file on any level (except for file transfer maybe).

That said, maybe both can work together. So if a plugin annotates an argument as something that should be cached, QIIME 2 just does it like you suggest, and if it sees the same UUID again, then it gets to skip a lot of disk IO. Additionally, this cache can be explicitly managed, so the user could choose to preempt this system and provide some reference to an element of the cache, letting them delete some QZAs that they always use.

yoshiki · April 21, 2020, 7:03pm

This issue has manifested once again in the context of projects with hundreds of GBs or on the TB scale. These sizes are more common for metagenomics projects and 16S samples run in a NovaSeq instrument. Although as pointed out by @wasade this is also a problem with large distance matrices.

For posterity here's the main answers I heard about when I posted about this on Slack:

@ebolyen suggested:

hypothetical idea: what if you specified an alternative extension (like .q2ref ) which q2cli would interpret as a desire to not save as a zip, and then in the framework we have a configuration for specifying a single fast device where it will place full archive structures (named by UUID, so basically just what it does now) and a working space for transformers so we can use hard-links when possible. Then if your seq data (and other large things as @wasade mentioned) is on the same device, it will hard-link during import so there's no copy, and you end up with a soft-link kind of artifact which just points at the right archive structure.Garbage collection becomes a problem, but perhaps that is left to the user for now?
and then whenever you have a large step where you don't want to deal with the zip file, you can just use the alternative extension

@wasade suggested a sweet filename extension:

qzb for QIIME zip BIG.

@ebolyen further explanied:

... I am imagining the qzb is not a zip at all, rather we have a centralized storage of these artifacts, hence the desire to hard-link, and then the transformer situation is the same either way. Basically the actual way q2 sees an artifact object is as a directory, so it would be super trivial to just not use /tmp/ for that, if we abandon the zip file for this storage mode, then we just instantiate a normal artifact object off of a different location (there's a notion of multiple Archivers, we just have only the ZipArchiver).
so if there was a central store that looked like the archive format already, this becomes pretty easy to implement
the central store also solves the problem of a working directory for transformers, because ideally we don't go back out to /tmp/ for this

@wasade pointed out a possible issue:

one issue here though is there may be many storage systems
looking at this from qiita, it would be wonderful to have an artifact refer to primary data in /projects/qiita_data/…/specific-preparation/ while having where things are written in this big representation under /fancy-storage/${user}/qiime2-things . We’d then need to migrate some stuff, possibly all? from fancy storage back in to /projects

Lastly @ebolyen pointed this out:

I think if we could come up with a storage scheme that minimizes disk transfers in the context of qiita we would by nature have a solution to 1
so the hard bit is making sure transformers do their work in the right place
playing it super naive, a .transform_the_data/ directory next to the real data probably won't work, since read-only directories are pretty common for this
on some level we probably need a sysadmin to configure something

qiyunzhu · July 13, 2020, 5:46pm

Hi all, this is Qiyun Zhu who is working with Rob. I was pointed to here by @antgonza and @yoshiki to add another example where caching or alternative forms of data storage is needed. I have been developing a plugin: q2-woltka, which is dedicated to the analysis of shotgun metagenomic datasets. Following the original QIIME 2 design goal, I let the program work on artifacts instead of raw files. However, the files we are tallking about are huge -- they are Bowtie2 alignment files and they can easily go up to a few hundreds of GB -- one degree of magnitude larger than the Bowtie2 index itself which we are currently using in the Knight Lab. Therefore gzipping/unzipping to/from artifacts is seemingly inefficient. I am bringing up this case and I will keep track on the discussion here!