Upate for future users:
The awesome @lizgehret, @gregcaporaso, and @colinvwood pointed out that I can use the provenance_lib
library which is incldued in 2023.5 and later to pull out the provenance as a DAG. (H/T @ChrisKeefe for building it And the Cap lab in general for having coding standards that are relatively easy to follow!)
https://github.com/qiime2/provenance-lib#python-3-api
I wrote code heavily inspired by (aka shamelessly borrowed) from the provenance_lib
library
If my artifact path is artifact_fp
then I extracted the provenience like this:
import numpy as np
import pandas as pd
import networkxx as nx
import provenance_lib as pl
# Reads the provenance into a directed acyclic graph so it can
# exist in python. I dont care about metadata right now, and I
# want to walk the full tree, I think?
dag = pl.ProvDAG(artifact_fp, parse_metadata=False, recurse=True)
The DAG representation is basically a networkxx network that connects the artifacts (nodes) via commands. The edges sort of show you the directionality. Liz, Chris, or someone from the Caporaso lab is probably better equipped to explain than I.
Once I have the DAG, then I got the ordered list of nodes and a list of commands IDs and outputs. Its worth noting the ordered list thing does not work well when you have several artifacts coming together.
# Shameless borrowed from provenance_lib
sorted_nodes = nx.topological_sort(dag.collapsed_view)
actions = pl.replay.group_by_action(dag, sorted_nodes)
My goal is to get a python provenance for each action that I can interact with, and pull specific parameters, so I (again heavily inspired by provenance_lib) extracted the commands into a python dict.
def _extract_node_parameters(node, outputs_):
"""
Converts the node parameters into a dictionary because thats what I need
"""
action = node.action
description = {
"command-id": action.action_id,
'plugin': action.plugin,
'command': action.action_name,
'parameters': n_data.action.parameters,
'inputs': action.inputs,
'outputs': {k: v for v, k in outputs_.items()}
}
return description
der_actions = []
for action_id in (std_actions := actions.std_actions):
# We are replaying actions not nodes, so any associated node works
some_node_id_from_this_action = next(iter(std_actions[action_id]))
n_data = dag.get_node_data(some_node_id_from_this_action)
if n_data.action.action_type == 'import':
continue
else:
command_ = _extract_node_parameters(n_data, std_actions[action_id])
der_actions.append(command_)
I got a list of actions with a series of descriptions. This is probably similar to what provenance_lib generates, but rather than having to re-parse a text file, it's now in a dictionary format.
My specific use case in this example was to simplify provenance across multiple batches within a single framework, so I converted my dictionary to a dataframe. (Yay pandas!) And then dereplicated on unique commands.
def comamnd_to_series(cmd_):
"""
Converts action to a series'
"""
parameter_str = ', '.join([f'{k}={v}' for k, v in cmd_['parameters'].items()])
input_str = ', '.join([f'{k}={v}' for k,v in cmd_['inputs'].items()])
cmd2 = copy.copy(cmd_)
cmd2['parameters'] = parameter_str
cmd2['inputs'] = input_str
return pd.Series(cmd2)
actions_df = pd.DataFrame([
comamnd_to_series(cmd) for cmd in der_actions
])
unique_df = actions_df.copy().drop_duplicates(['plugin', 'command', 'parameters'])
if unique_df.duplicated(['plugin', 'command']).any():
raise ValueError('The same command has been executed with different parameters.'
'Go to an earlier version of the data')
I also extracted the individual parameters into a long-form series that I can pull to take specific parameters I remember changing and put them into a metadata-style table for future me.