QIIME 2 Artifact Caching

(Evan Bolyen) #1

I was talking to @thermokarst the other day about how we might be able to cache things like database index files, and I wanted to write it down here (for input and so we don’t forget).

What if there was a user managed artifact cache? This seems like it would avoid a lot of the issues associated with automated caching. Imagine the following scenarios:


I have a large bowtie2 index, and I want to reference it a lot. I can run the following command:

qiime tools send-to-cache --name sra1 my-index.qza

This will take the .qza and extract it into something like:

$XDG_CACHE_HOME/qiime2/b08b8ada-2b64-4548-9d21-e549dd9d0a4d/

and it would update some file:

$XDG_CACHE_HOME/qiime2/cache-manifest.yaml

to set this line:
sra1: b08b8ada-2b64-4548-9d21-e549dd9d0a4d

I would then be able use this artifact in any command with some syntax like:

--i-input-data {{ sra1 }}

You could imagine importing directly into the cache as well, skipping the zip file entirely:

qiime tools import-cache \
  --type ... \
  --input-path ... \
  --input-format ... \
  --output-name sra1

What’s kind of neat about this is our object model doesn’t really need things to be in a zip file at all, they get extracted to a temporary location with the same file structure as the archive. This means we really don’t need much code at all to support the above, as we’d just instantiate an artifact from a different filepath and add some logic to retrieve/add/remove/inspect things from the cache


Questions for the community:

  1. Does this workflow seem reasonable? Is it too much effort to think about and manage a cache?
    (You could also imagine putting classifiers in the cache so that you don’t have to find them on disk)

  2. What specific names make sense here?

  3. What should the syntax be for retrieval for q2cli? Something BASH safe that doesn’t conflict with files would be nice, but that might be asking too much.

  4. Where are these files actually stored? Is $XDG_CACHE_HOME generally configured to be large enough for this, or do we need a real configuration file to set this?

cc @epruesse

3 Likes

(Nicholas Bokulich) #2

Caching sounds pretty useful for artifacts that are used often, such as classifiers or other database artifacts, but are the benefits worth the effort? I also see some concerns with managing such a cache. A few questions/challenges to add to your list of questions:

  1. is the end goal here to create aliases for easy retrieval across studies (e.g., for classifiers and other commonly used artifacts) or actually for caching important files, e.g., so they don’t get lost?
  2. what other types of artifacts could benefit from this? I could see something like a feature table, which is input to many actions in a single analysis, but this is still short-term and probably not a worthwhile target for caching.
  3. Classifiers often need to be updated, e.g., every time :qiime2: uses a different release of sklearn, so even then this cached artifact would be temporary. Similarly, databases grow and are updated regularly (with at least one notable exception!) and classifiers (and other database artifacts) need to be retrained.
  4. If this is for CLI users, they could just set a filepath alias in their bash profile to accomplish the same thing (or at least the retrieval part), no? The sort of power users who use QIIME 2 frequently enough with the same classifiers/databases and hence would find this cache useful are most likely already doing (or could do) something like this. For most QIIME 2 users, caching will not be as useful.

Managing such a cache also seems like a challenge for most users. So I suppose the bottom line is this: what users/how many would find artifact caching useful?

0 Likes

(Evan Bolyen) #3

Good questions/points!

I’m thinking the main use-case for this is for databases which are many gigabytes which means unzipping would take several minutes. Meanwhile the format within the database is likely optimized for random retrieval meaning doing this outside of QIIME 2 would have taken only a few seconds for access. I don’t think it would be worth messing with the cache except for these large, optimized, reference indices/databases.

This would be an example. More generally, anything related to Bowtie2 or other shotgun metagenomics tools would really benefit.

Yep, that’s definitely true, so the only benefit here is avoiding the zip/unzip behavior, while still letting QIIME 2 keep provenance and all of those other good things.

Not necessarily, as the interface still needs to load the Artifact in a different way as the cache doesn’t contain .qza files, just plain-directory archives. Arguably, this could also be achieved by letting q2cli/other interfaces load artifacts that just aren’t in a .qza, but I like the idea of a cache in that it prevents you from being “in charge” of the archive directories as they aren’t “protected” in a zip file anymore.

2 Likes

(Nicholas Bokulich) #4

Thanks! The example makes the goal much more clear. This makes much more sense for something like bowtie2 databases especially if they are frequently/repeatedly accessed. As we may wind up with a few more plugins using bowtie2 databases, this could be a useful feature to have, though in the near term it seems to impact only a small number of users (anyone using SINA or shogun).

0 Likes

(Elmar Pruesse) #5

The user managed caching concept you outlined sounds awfully complicated to me. I don’t think that’s going to appeal to users much. The point of using a qiime2 plugin is to have things be simple.

I had pictured a cache feature in the Qiime2 SDK where a plugin can request that a QZA input be cached by just keeping the staging location between qiime invocations rather than deleting it at exit. That would allow tools to create their cache files next to the source data as they usually do - whether that’s .bt2 or .pt or myindex type files.

Users are already familiar with that concept and its caveats. If you use conda you quickly learn to use conda clean lest your miniconda installation occupy tens of GBs for tarballs and old packages you no longer use. A single log message along the lines of Cached xyz.qza. Total cache size is now 2.3GB would suffice to cue users to find and use a qiime cache clean command should they need to free disk space.

If you want fancy, add a .qiime2rc where people can set the cache location, a maximum cache size, a maximum cache entry size. A simple LRU approach should make this work transparently for most use cases. If you want to be really fancy, have something that does a disk-vs-time cost analysis. But I don’t think this is quite necessary at this time. KISS principle.

W.r.t. SINA: The most recent versions already have an internal fast index method that can index the latest SILVA Ref NR in about 4 minutes on my macbook and creates a ~400MB cache file that allows the next iteration to launch nearly instantly. (The ARB PT server based versions need about 1 hour to index and create some 2GB of index files.)

1 Like