Hi @taknotts,
Sorry for the delay
This is a really good question, so I will write a general tutorial on looking at provenance, since it doesn't appear we have any resources on this!
I'll show two ways to get this information using the moving pictures as an example, please adapt the approach to your specific analysis. The general theme is using UUIDs (and checksums) instead of filenames (as the UUIDs are always unique, whereas we can sometimes forget filenames or overwrite them).
Option 1 (using q2view):
Let's suppose we have a taxonomic barplot and we just don't believe what its saying, so we think something went wrong. I'll use this particular visualization:
https://view.qiime2.org/visualization/?type=html&src=https%3A%2F%2Fdocs.qiime2.org%2F2019.4%2Fdata%2Ftutorials%2Fmoving-pictures%2Ftaxa-bar-plots.qzv
If we look at the provenance tab we'll see something like this:
If we rearrange it to not look ridiculous and then select the "classifier" node, we can see the Universally Unique IDentifier (UUID) in the top right.
I can compare this ID to a classifier I have downloaded from data resources (or perhaps trained myself).
The UUID I'm looking for is this:
ccb8cd3a-ad34-44fc-882c-6428eed203df
Looking at some of the classifiers, I'm pretty sure I was using the GG 515/806 version, so I download that and look at the UUID (using the peek
tab):
We see that the UUID matches what we were looking for, so I know for a fact that this barplot was made with this classifier I just downloaded (or perhaps trained).
Maybe I don't have the classifier at all:
Now supposing for whatever reason I don't even have the classifier any more, I can still look at provenance to try to identify the raw files used:
Here we see that the reference_taxonomy
used had an "import" step (by clicking on the box instead of the circle we can see the action that produced the circle), and during that, a manifest was recorded of the filenames used, and the md5sum (which is a hash/fingerprint of the file). That means I could find a file on my system with (ideally, but not necessarily) that filename, and double check that the md5sum matches (on OS X this is done with the md5
command, on Linux md5sum
, it's a common algorithm, so there are other implementations as well).
We can also do the same thing for the sequences used. Using this, we can pinpoint the raw data used for creating the database.
Option 2 (reading provenance manually)
This is a good fallback if all else fails, but its annoying as we need to walk the graph manually, step by step. I'll show how its done, and then you follow essentially the same steps as above. I have an editor with syntax highlighting so it will look pretty in my screenshots, but may be just black and white in other editors.
Because QIIME 2 artifacts and visualizations are just ZIP files, I can use my operating systems file browser to look inside (on OS X you do need to extract first).
The important bit is of course the provenance
directory which I've expanded. There is an action
sub-directory which is the action which made this particular result. Then there is also the artifacts
sub-directory which has all of the ancestral artifacts used.
At this point you could actually skip the rest of the work, and notice that there is a directory named ccb8cd3a-ad34-44fc-882c-6428eed203df and so we must have used that classifier for this barplot, but let's see how we could trace backwards to it:
If we look in the file named provenance/action/action.yaml
we see some text which looks just like the sidebar in q2view.
There's a lot of information, but we only care about the action
section which follows the execution
section. In particular we look at the inputs
and see two: table
and taxonomy
. We need to be looking at the taxonomy, and we note that its UUID is 45444431-ebc8-4dbf-9f68-caab93f5e00e.
Looking back at our archive, we can find a directory provenance/artifacts/45444431-ebc8-4dbf-9f68-caab93f5e00e
and expanding that we see:
Which has a directory structure just like our first action
subdirectory. At this point we would look insi
de action.yaml
for 45444431-ebc8-4dbf-9f68-caab93f5e00e
and see a similar file as before:
Here we see the classifier used, and we can repeat this process until we reach the import steps, which will have no further UUIDs to trace and looks like this:
This is a lot more work than using q2view, but you'll notice the data is exactly the same as q2view, because that is how we store it in the zip file!
Hopefully that is helpful.