Propagating Provenance

jwdebelius · August 25, 2020, 7:11pm

Hi delightful :qiime2: dev team,

I've got a fun new problem!

I have a function where I'm combining a feature table, reference, and rep-set from multiple regions into a single function. There is no fixed number of regions, but I need to enforce a strict mapping between the regions in some way so I can retain information about the regional labels and order. ...I ended up solving this with a manifest format.

The problem with the manifest is that I lose the provenance of the artifacts that went into my output files looping them this way. ...Is there a smarter way that (a) i can combine multiple strictly related files into a single function or (b) somehow account from the provenance?

github.com

jwdebelius/q2-sidle/blob/main/q2_sidle/_reconstruct.py

import copy
import itertools as it
import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import biom
import dask
from dask.delayed import Delayed
import dask.dataframe as dd
import numpy as np
import pandas as pd

from qiime2 import Metadata, Artifact
from qiime2.plugin import ValidationError
from q2_sidle._utils import (_setup_dask_client, 
                             degen_reps,
                             _check_regions,
                             )
from q2_sidle._build_database import _check_regions

This file has been truncated. show original

Best,
Justine

jwdebelius · September 1, 2020, 4:02pm

Hey at @ebolyen, can i ping you on this? I know youre super busy, and when you've got bandwidth, I'd love help or suggestions.

ebolyen · September 1, 2020, 6:01pm

Hey @jwdebelius!!

I think there's not a particularly awesome way to handle this right now. From my current perspective, I think provenance makes the most sense when it is transparent to the action¹, which leaves mostly just inputs as a way to manage this.

Now there are some good options on that front I think. For the immediate, you can create several List[FeatureWhatever] types and verify they have the same length. The list will preserve order and you can use that to identify regions. This is pretty inconvenient, but it will work (you may wish to have a List[Str] parameter which is the region label). The obvious downside is it is on the user to make sure the order is the same, so the relationship is implied rather than explicit. Now in a tutorial you could make this clear via formatting:

qiime some-action \
   --p-names a \
     --i-foos foo-a.qza \ 
     --i-bars bar-a.qza \
   --p-names b \
     --i-foos foo-b.qza \ 
     --i-bars bar-b.qza \
   --p-names c \
     --i-foos foo-c.qza \ 
     --i-bars bar-c.qza \
   --o-stuff output.qza

Then in the plugin you could do something like:

for foo, bar, name in zip(foos, bars, names):
    # do stuff

But again, this would just be a convention strongly implied by formatting, but not enforced.

Now there was some discussion between myself, @wasade, @thermokarst, and @yoshiki about how to create "record" types, which would more explicitly handle this situation. For instance, suppose there was a type such as:

List[Tuple[Feature[Foo], Sample[Bar], Str]]

or

Mapping[Str, Tuple[Feature[Foo], Sample[Bar]]

This would create a really obvious structure for the plugin's action to use (a list of tuples or a dict of tuples). What is less obvious is how to create a command line interface for this. There was some discussion of using : or :: as a "field delimiter", but this is all still pretty speculative. Alternatively perhaps something like a prefs file is necessary. The distinction between this and your manifest, would be that a prefs file is still strongly typed from components an interface would understand, and isn't actually what the Artifact side of the action uses, so an interface could generate a well-typed python structure through a GUI (which perhaps only shows artifacts in your workspace) if it wished. Whereas the command line would create that same python structure by parsing a prefs file of its own design. (Ironically, a GUI would be very straight-forward for this kind of complex record type, it's the command line that's the challenge!)

As an alternative to "inputs", it occurs to me that it would also be easy enough to extend pipeline ctx objects to "use" an artifact. So you might hypothetically create a manifest in an artifact and then "side-load" the artifacts. While provenance could certainly be preserved with some hypothetical method: ctx.use_artifact("path"), I think this is a bad idea, because interfaces can no longer reason about data to be used. This makes creating more managed interfaces infeasible as you would need to know about all actions which side-load artifacts (and from where), and suddenly the interface doesn't get to control where those artifacts live. This is pretty contrary to our goals and couples interfaces to plugins, which is super-not-good. Also what will likely happen almost immediately, if we did implement this, is some plugin will start using a particular path as a "reference database" and side-load that without the interface (and therefore, usually, user) having a say.

jwdebelius · September 1, 2020, 7:15pm

Hi @ebolyen,

Thank you for the really good feedback! I'll experiment with the ordering and maybe that will be a separate parameter. The issue you raised about things being inferface agnostic is an important one, since the manifest approach does assume that data exists in files, which builds on that CLI idea.

I do think a way to handle more complex scenarios would through mapping would be really useful, so I'll keep tuned.

Again, thanks!

Best,
Justine