Provenance Replay Beta Release and Tutorial

ChrisKeefe · June 7, 2022, 10:54pm

Hello everyone! I'm super excited to announce the ~~alpha release~~ (update: beta release) of provenance-lib, software I've been building to support scientific reproducibility, attribution, and collaboration on the QIIME 2 platform.

~~provenance_lib is available in alpha release on the QIIME 2 Library~~

Update: As of QIIME 2 2023.9, provenance-lib has been integrated into QIIME 2.

For your entertainment and edification, I've included examples of some fun things you can do with it here. Please note that these commands may change in future releases, and this post may not be updated. The key ideas should remain the same, though, and the in-software documentation and library page will be kept current.

A video illustrating the use of this new functionality is available on the QIIME 2 YouTube channel here.

About

provenance-lib parses the computational history (or "provenance") of QIIME 2 Results, letting you:

generate executable scripts for your preferred QIIME 2 interface, allowing users to "replay" prior analyses
report all citations from a QIIME 2 analysis, so creating your reference section is easy
produce reproducibility supplements for publication or collaboration, supporting the reproducibility of your QIIME 2 work

Getting help for the CLI

Three commands are available:

replay-provenance
replay-citations
replay-supplement

Each of these is available under the qiime tools plugin. If there's a command you'd like to know more about, use qiime tools <your-replay-command> --help.

In the examples below, we'll use replay-supplement because it's a catch-all, producing a zipfile with citations and provenance replay results packed inside. The other commands are similar, but produce only component parts of the supplement.

If reading the documentation doesn't solve your problem, please raise user support questions on the QIIME 2 Forum.

Please do not raise user support questions as issues on Github.
They may be closed without response as off topic.

Use - CLI

Activate your 2023.9 or later QIIME 2 conda environment.
Navigate into a directory with some QIIME 2 Results in it (i.e. .qza and .qzv files). If you don't have any of these, you can download some from the QIIME 2 Moving Pictures tutorial.

If you run the following, it will produce a zip archive called reproducibility_supplement.zip

qiime tools replay-supplement \
  --in-fp . \
  --out-fp ./reproducibility_supplement.zip

Note that parsing many Results or Results with lengthy provenances can take a long time. Expect ~10 minutes for 500 Results on a decent contemporary laptop.

Supplement outputs

If you run the command above in a directory containing all of the Results from one experiment, it will generate reproducibility documentation you can include as supplemental material alongside the paper. Unzip it to find the following:

a directory of metadata .tsv files called recorded_metadata
a python3 replay script written to python3_replay.py
a cli replay script written to cli_replay.sh
a citations bibtex file written to citations.bib

Use --recurse to include subdirectories

I usually keep my Results for a QIIME 2 analysis in a folder with lots of subfolders. If you want to include all of those subfolders, just use the --recurse flag:

qiime tools replay-supplement \
  --in-fp . \
  --recurse \
  --out-fp ./reproducibility_supplement.zip

Without --recurse, you will report on only the Results in the current directory: ./

Generate a comprehensive works cited

You can import citations.bib into Zotero (or Mendeley, or your fave citation manager), and then generate citations for all of the computational methods you used. This provides the benefit of crediting more of the scientists and developers who build the less charismatic totally critical underlying software that makes so much of what we do possible.

Some publications prohibit works cited that don't have in-text references. We're working on a supplemental methods manifest that might help alleviate this issue, but at the end of the day, what your publisher says goes.

Don't share private metadata

The directory of metadata files may be removed, edited, or included as-is. That's up to you. It's very important to remember that some metadata cannot generally be published (e.g. if it includes subjects' personal health information). By passing the --no-dump-recorded-metadata flag to the command, you can prevent all sample metadata from being written to the supplement. If you fail to use this, you can still delete the metadata directory before sharing.

qiime tools replay-supplement \
  --in-fp . \
  --no-dump-recorded-metadata \
  --out-fp ./reproducibility_supplement.zip

Include public metadata to simplify corroboration

On the other hand, if your study metadata can be published, including it will make it easier for others to corroborate and extend your work. Passing the --use-recorded-metadata flag will insert references to the dumped metadata files into your scripts, so that each command uses the original metadata automatically if the script is run.

qiime tools replay-supplement \
  --in-fp . \
  --use-recorded-metadata \
  --out-fp ./reproducibility_supplement.zip

Publish your original data if possible

If you are allowed to publish your original data, doing so will make your work easier to corroborate and extend, adding potential value for later investigators and potentially increasing the likelihood of citation. There are plenty of data repositories out there, and which you choose is entirely up to you.

A common gotcha - it may be worth publishing your data in the same format it was in when you began your actual analysis. A recent study I participated in published data to SRA, which requires a demultiplexed format different from that used in our actual analysis process. In addition to SRA, then, we've included the raw sequence files (in the original format we used for the study) in our reproducibility supplement. This way, users of the supplement can actually re-run the scripts directly, without rewriting the steps where data is imported into QIIME 2 to accomodate the demultiplexed SRA format. By packaging data and scripts together, we can also hard-code relative file paths for the input data into our replay scripts - users only have to unzip and run.

Best practice: Edit your scripts before sharing

The CLI and Python 3 scripts included in the supplement will let users of different interfaces read, understand, corroborate, and extend your work in a straightforward way. However, they require a little bit of expert guidance to work effectively.

The scripts are self-documenting, and will tell you what changes need to be made. Here are a few fine points.

Follow all of the instructions in the scripts before sharing. The only thing worse than code that doesn't run, is someone else's code that doesn't run.
Point the script at your input data if possible. Input data is not captured in provenance; it's often too large. I like to include the data in the zip file when possible, so that I can hard-code file paths in the scripts. If that's not feasible for you, including basic instructions for your user on where to find your data, or what type of data you used so that they can bring their own.
If you're not including your metadata in the supplement, you may need to clarify some commands for the user. Specifically, cases where a command (e.g. longitudinal first-differences) takes both a metadata .tsv and an artifact passed as metadata, the .tsv file will not show up in the command. This issue will probably require some changes in the framework to resolve fully, so keep an eye out for it.

Consider re-running your analysis scripts before sharing

Typos happen, and I've made a few. If the compute costs aren't too high, re-running your analysis from the reproducibility scripts will ensure that your readers/users get working code.

Use - Python API

The Python API, at its most works just like the CLI. For simplicity, we've exposed the same three one-shot commands described above, with matching parameter sets. The primary differences are syntax and manual choice of qiime2 Usage Drivers. To render CLI output the q2cli package must be available in your environment.

Please read the CLI description above for general principles. A simple usage example follows:

from qiime2.core.archive import provenance_lib
from qiime2.core.archive.provenance_lib import ReplayPythonUsage
from q2cli.core.usage import ReplayCLIUsage  # if available

# command-specific helptext
help(provenance_lib.replay_supplement)

# iPython and Jupyter-notebook alternative:
? provenance_lib.replay_supplement

# Generate a reproducibility supplement for the current directory's
# Results including all of its subdirectories recursively
provenance_lib.replay_supplement(
    usage_drivers=[ReplayPythonUsage, ReplayCLIUsage],
    payload='.',
    out_fp='reproducibility-supplement.zip',
    recurse=True
)

More power and flexibility are available to users of the Python API, through
interaction with the underlying ProvDAG class.

Contributing

Again, please do not raise user support questions as issues on Github.
They may be closed without response as off topic.

To report bugs or propose new features or enhancements, please open an issue on Github. Contributions will be warmly welcomed.

ChrisKeefe · June 7, 2022, 10:55pm

Please note that I will be traveling through 8/12/22, and may be quite slow to respond during that time. Thanks for your patience!