Accessing provenance from Artifact API

jwdebelius · July 11, 2023, 5:39pm

Hey excellent qiime2 development team,

I'm doing a meta analysis of several thousand samples from about 25 different studies. If past me were smarter, she would have written an output file with the extact trimming parameters she used for each of those individual studies before combining the data. Alas, she lacked foresight and for that I currently hate her.

Then I remembered that qiime2 has a provenance tied into the Artifacts . I'd like to be able to programatically extract the provenance from a feature table artifact. My preference would be to do it through the python API instead of opening files.

Is it possible to access the provenance file via the Artifact view? And if so, how might I do that?

My computer system is locked up tighter than fort knox becuase my sys admins hate me and are slightly terrified of what might happen if we let the internet in, so I can't access qiime2view to try and dump the provenance that way. I also can't install any additional plugins, due to aforementioned issues.

Is there a way to get access to the provenance, or should I go back, re-process and do things (more) correctly?

Thanks,
Justine

jwdebelius · July 12, 2023, 1:49pm

Upate for future users:

The awesome @lizgehret, @gregcaporaso, and @colinvwood pointed out that I can use the provenance_lib library which is incldued in 2023.5 and later to pull out the provenance as a DAG. (H/T @ChrisKeefe for building it And the Cap lab in general for having coding standards that are relatively easy to follow!)

https://github.com/qiime2/provenance-lib#python-3-api

I wrote code heavily inspired by (aka shamelessly borrowed) from the provenance_lib library

If my artifact path is artifact_fp then I extracted the provenience like this:

import numpy as np
import pandas as pd
import networkxx as nx
import provenance_lib as pl

# Reads the provenance into a directed acyclic graph so it can 
# exist in python. I dont care about metadata right now, and I 
# want to walk the full tree, I think?
dag = pl.ProvDAG(artifact_fp, parse_metadata=False, recurse=True)

The DAG representation is basically a networkxx network that connects the artifacts (nodes) via commands. The edges sort of show you the directionality. Liz, Chris, or someone from the Caporaso lab is probably better equipped to explain than I.

Once I have the DAG, then I got the ordered list of nodes and a list of commands IDs and outputs. Its worth noting the ordered list thing does not work well when you have several artifacts coming together.

# Shameless borrowed from provenance_lib
sorted_nodes = nx.topological_sort(dag.collapsed_view)
actions = pl.replay.group_by_action(dag, sorted_nodes)

My goal is to get a python provenance for each action that I can interact with, and pull specific parameters, so I (again heavily inspired by provenance_lib) extracted the commands into a python dict.

def _extract_node_parameters(node, outputs_):
    """
    Converts the node parameters into a dictionary because thats what I need
    """
    action = node.action
    description = {
        "command-id": action.action_id,
        'plugin': action.plugin,
        'command': action.action_name,
        'parameters': n_data.action.parameters,
        'inputs': action.inputs,
        'outputs': {k: v for v, k in outputs_.items()}
    }
    return description

der_actions = []
for action_id in (std_actions := actions.std_actions):
    # We are replaying actions not nodes, so any associated node works
    some_node_id_from_this_action = next(iter(std_actions[action_id]))
    n_data = dag.get_node_data(some_node_id_from_this_action)
    if n_data.action.action_type == 'import':
        continue
    else:
        command_ = _extract_node_parameters(n_data, std_actions[action_id])
        der_actions.append(command_)

I got a list of actions with a series of descriptions. This is probably similar to what provenance_lib generates, but rather than having to re-parse a text file, it's now in a dictionary format.

My specific use case in this example was to simplify provenance across multiple batches within a single framework, so I converted my dictionary to a dataframe. (Yay pandas!) And then dereplicated on unique commands.

def comamnd_to_series(cmd_):
    """
    Converts action to a series'
    """
    parameter_str = ', '.join([f'{k}={v}' for k, v in  cmd_['parameters'].items()])
    input_str = ', '.join([f'{k}={v}' for k,v in cmd_['inputs'].items()])

    cmd2 = copy.copy(cmd_)
    cmd2['parameters'] = parameter_str
    cmd2['inputs'] = input_str

    return pd.Series(cmd2)
actions_df = pd.DataFrame([
    comamnd_to_series(cmd) for cmd in der_actions
])
unique_df = actions_df.copy().drop_duplicates(['plugin', 'command', 'parameters'])

if unique_df.duplicated(['plugin', 'command']).any():
    raise ValueError('The same command has been executed with different parameters.'
                              'Go to an earlier version of the data')

I also extracted the individual parameters into a long-form series that I can pull to take specific parameters I remember changing and put them into a metadata-style table for future me.

lizgehret · July 12, 2023, 4:26pm

This is a fantastic summary, thanks so much for this @jwdebelius!

gregcaporaso · July 13, 2023, 4:59pm

Agreed! Thanks so much for sharing this @jwdebelius! Glad you were able to make this work!

system · August 13, 2023, 10:59pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.