Provenance Replay Alpha Release and Tutorial

Hello everyone! I'm super excited to announce the alpha release of provenance_lib, software I've been building to support scientific reproducibility, attribution, and collaboration on the QIIME 2 platform.

:tada: :qiime2: :tada: provenance_lib is available in alpha release on the QIIME 2 Library :tada: :qiime2: :tada:

Installation instructions are available on the library page, as is basic usage documentation. For your entertainment and edification, I've included examples of some fun things you can do with it here. Please note that these commands may change in future releases, and this post may not be updated. The key ideas should remain the same, though, and the in-software documentation and library page will be kept current.

A video illustrating the use of this new functionality is available on the QIIME 2 YouTube channel here.

About

provenance_lib parses the computational history (or "provenance") of QIIME 2 Results, letting you:

  • generate executable scripts for your preferred QIIME 2 interface, allowing users to "replay" prior analyses
  • report all citations from a QIIME 2 analysis, so creating your reference section is easy
  • produce reproducibility supplements for publication or collaboration, supporting the reproducibility of your QIIME 2 work

Getting help for the CLI

You can get information about what commands are available, with replay --help.
If there's a command you'd like to know more about, use replay that-command --help, or even replay that-command with no arguments.

It is possible to:

  • replay citations
  • replay provenance
  • replay supplement

In the examples below, we'll use the last of these because it's a catch-all, producing a zipfile with citations and provenance replay results packed inside. The other commands are similar, but produce only component parts of the supplement.

If reading the documentation doesn't solve your problem, please raise user support questions in the Community Plugin Support category of the QIIME 2 Forum. Mentioning me in your post, @ChrisKeefe, will help me respond to your questions quickly.

Please do not raise user support questions as issues on the Github repository.
They may be closed without response as off topic.

Use - CLI :badger:

  1. Install provenance_lib using the directions.
  2. Activate the conda environment in which you've installed provenance_lib.
  3. Navigate into a directory with some QIIME 2 Results in it (i.e. .qza and .qzv files).
    If you don't have one handy, you can navigate into the provenance-lib directory you downloaded during installation and find provenance-lib/provenance_lib/tests/data/parse_dir_test/

If you run the following, it will produce a zip archive called reproducibility_supplement.zip

replay supplement \
  --i-in-fp . \
  --o-out-fp ./reproducibility_supplement.zip

Note that parsing many Results can take a long time. Expect ~10 minutes for 500 Results on a decent contemporary laptop.

Supplement outputs :popcorn:

If you run the command above in a directory containing all of the Results from one experiment, it will generate reproducibility documentation you can include as supplemental material alongside the paper. Unzip it to find the following:

  • a directory of metadata .tsv files called recorded_metadata
  • a python3 replay script written to python3_replay.py
  • a cli replay script written to cli_replay.sh
  • a citations bibtex file written to citations.bib

Use --p-recurse to include subdirectories :shell:

I usually keep my Results for a QIIME 2 analysis in a folder with lots of subfolders. If you want to include all of those subfolders, just use the --p-recurse flag:

replay supplement \
  --i-in-fp . \
  --p-recurse \
  --o-out-fp ./reproducibility_supplement.zip

Without --p-recurse, you will report on only the Results in the current directory: ./

Generate a comprehensive works cited :nerd_face:

You can import citations.bib into Zotero (or Mendeley, or your fave citation manager), and then generate citations for all of the computational methods you used. This provides the benefit of crediting more of the scientists and developers who build the less charismatic totally critical underlying software that makes so much of what we do possible.

Some publications prohibit works cited that don't have in-text references. We're working on a supplemental methods manifest that might help alleviate this issue, but at the end of the day, what your publisher says goes.

Don't share private metadata :construction: :fire: :stop_sign: :zap: :construction:

The directory of metadata files may be removed, edited, or included as-is. That's up to you. It's very important to remember that some metadata cannot generally be published (e.g. if it includes subjects' personal health information). By passing the --p-no-dump-recorded-metadata flag to the command, you can prevent all sample metadata from being written to the supplement. If you fail to use this, you can still delete the metadata directory before sharing.

replay supplement \
  --i-in-fp . \
  --p-no-dump-recorded-metadata \
  --o-out-fp ./reproducibility_supplement.zip

Include public metadata to simplify corroboration :hugs:

On the other hand, if your study metadata can be published, including it will make it easier for others to corroborate and extend your work. Passing the --p-use-recorded-metadata flag will insert references to the dumped metadata files into your scripts, so that each command uses the original metadata automatically if the script is run.

replay supplement \
  --i-in-fp . \
  --p-use-recorded-metadata \
  --o-out-fp ./reproducibility_supplement.zip

Publish your original data if possible :ant:

If you are allowed to publish your original data, doing so will make your work easier to corroborate and extend, adding potential value for later investigators and potentially increasing the likelihood of citation. There are plenty of data repositories out there, and which you choose is entirely up to you.

A common gotcha - it may be worth publishing your data in the same format it was in when you began your actual analysis. A recent study I participated in published data to SRA, which requires a demultiplexed format different from that used in our actual analysis process. In addition to SRA, then, we've included the raw sequence files (in the original format we used for the study) in our reproducibility supplement. This way, users of the supplement can actually re-run the scripts directly, without rewriting the steps where data is imported into QIIME 2 to accomodate the demultiplexed SRA format. By packaging data and scripts together, we can also hard-code relative file paths for the input data into our replay scripts - users only have to unzip and run.

Best practice: Edit your scripts before sharing :writing_hand:

The CLI and Python 3 scripts included in the supplement will let users of different interfaces read, understand, corroborate, and extend your work in a straightforward way. However, they require a little bit of expert guidance to work effectively.

The scripts are self-documenting, and will tell you what changes need to be made. Here are a few fine points.

  1. Follow all of the instructions in the scripts before sharing. The only thing worse than code that doesn't run, is someone else's code that doesn't run. :wink:
  2. Point the script at your input data if possible. Input data is not captured in provenance; it's often too large. I like to include the data in the zip file when possible, so that I can hard-code file paths in the scripts. If that's not feasible for you, including basic instructions for your user on where to find your data, or what type of data you used so that they can bring their own.
  3. If you're not including your metadata in the supplement, you may need to clarify some commands for the user. Specifically, cases where a command (e.g. longitudinal first-differences) takes both a metadata .tsv and an artifact passed as metadata, the .tsv file will not show up in the command. This issue will probably require some changes in the framework to resolve fully, so keep an eye out for it.

Consider re-running your analysis scripts before sharing :honeybee:

Typos happen, and I've made a few. If the compute costs aren't too high, re-running your analysis from the reproducibility scripts will ensure that your readers/users get working code.

Use - Python API :snake:

The Python API, at its most works just like the CLI. For simplicity, we've exposed the same three one-shot commands described above, with matching parameter sets. The primary difference in basic usage is syntactical.

Please read the CLI description above for general principles. A simple usage example follows:

import provenance_lib

# helptext for the package
help(provenance_lib)

# command-specific helptext
help(provenance_lib.replay_supplement)

# iPython and Jupyter-notebook alternative:
? provenance_lib.replay_supplement

# Generate a reproducibility supplement for the current directory's
# Results including all of its subdirectories recursively
provenance_lib.replay_supplement(
    '.', './reproducibility-supplement.zip', recurse=True)

More power and flexibility are available to users of the Python API, through
interaction with the underlying ProvDAG class. A deeper dive into the python API is linked from the QIIME 2 Library

Contributing :doughnut:

Again, please do not raise user support questions as issues on the Github repository.
They may be closed without response as off topic.

To report bugs or propose new features or enhancements, please open an issue on Github. Contributions will be warmly welcomed.

11 Likes

Please note that I will be traveling through 8/12/22, and may be quite slow to respond during that time. Thanks for your patience!

2 Likes