is there information in provenance on the classifier file used?

(Trina Knotts) #1

I wanted to check the classifier version (file name) I used with a dataset and was trying to look through the provenance to find it but I could not locate it (inside action.yaml) for taxonomy.qza (which I thought would be the right place to look). Any help would be appreciated.



(Matthew Ryan Dillon) assigned ebolyen #2
(Nicholas Bokulich) unassigned ebolyen #3
(Nicholas Bokulich) assigned ebolyen #4
(Evan Bolyen) #5

Hi @taknotts,

Sorry for the delay :frowning:

This is a really good question, so I will write a general tutorial on looking at provenance, since it doesn’t appear we have any resources on this!

I’ll show two ways to get this information using the moving pictures as an example, please adapt the approach to your specific analysis. The general theme is using UUIDs (and checksums) instead of filenames (as the UUIDs are always unique, whereas we can sometimes forget filenames or overwrite them).

Option 1 (using q2view):

Let’s suppose we have a taxonomic barplot and we just don’t believe what its saying, so we think something went wrong. I’ll use this particular visualization:

If we look at the provenance tab we’ll see something like this:

If we rearrange it to not look ridiculous and then select the “classifier” node, we can see the Universally Unique IDentifier (UUID) in the top right.

I can compare this ID to a classifier I have downloaded from data resources (or perhaps trained myself).
The UUID I’m looking for is this:

Looking at some of the classifiers, I’m pretty sure I was using the GG 515/806 version, so I download that and look at the UUID (using the peek tab):

We see that the UUID matches what we were looking for, so I know for a fact that this barplot was made with this classifier I just downloaded (or perhaps trained).

Maybe I don’t have the classifier at all:

Now supposing for whatever reason I don’t even have the classifier any more, I can still look at provenance to try to identify the raw files used:

Here we see that the reference_taxonomy used had an “import” step (by clicking on the box instead of the circle we can see the action that produced the circle), and during that, a manifest was recorded of the filenames used, and the md5sum (which is a hash/fingerprint of the file). That means I could find a file on my system with (ideally, but not necessarily) that filename, and double check that the md5sum matches (on OS X this is done with the md5 command, on Linux md5sum, it’s a common algorithm, so there are other implementations as well).

We can also do the same thing for the sequences used. Using this, we can pinpoint the raw data used for creating the database.

Option 2 (reading provenance manually)

This is a good fallback if all else fails, but its annoying as we need to walk the graph manually, step by step. I’ll show how its done, and then you follow essentially the same steps as above. I have an editor with syntax highlighting so it will look pretty in my screenshots, but may be just black and white in other editors.

Because QIIME 2 artifacts and visualizations are just ZIP files, I can use my operating systems file browser to look inside (on OS X you do need to extract first).

The important bit is of course the provenance directory which I’ve expanded. There is an action sub-directory which is the action which made this particular result. Then there is also the artifacts sub-directory which has all of the ancestral artifacts used.

At this point you could actually skip the rest of the work, and notice that there is a directory named ccb8cd3a-ad34-44fc-882c-6428eed203df and so we must have used that classifier for this barplot, but let’s see how we could trace backwards to it:
If we look in the file named provenance/action/action.yaml we see some text which looks just like the sidebar in q2view.

There’s a lot of information, but we only care about the action section which follows the execution section. In particular we look at the inputs and see two: table and taxonomy. We need to be looking at the taxonomy, and we note that its UUID is 45444431-ebc8-4dbf-9f68-caab93f5e00e.

Looking back at our archive, we can find a directory provenance/artifacts/45444431-ebc8-4dbf-9f68-caab93f5e00e and expanding that we see:


Which has a directory structure just like our first action subdirectory. At this point we would look insi
de action.yaml for 45444431-ebc8-4dbf-9f68-caab93f5e00e and see a similar file as before:

Here we see the classifier used, and we can repeat this process until we reach the import steps, which will have no further UUIDs to trace and looks like this:


This is a lot more work than using q2view, but you’ll notice the data is exactly the same as q2view, because that is how we store it in the zip file!

Hopefully that is helpful.

(Evan Bolyen) unassigned ebolyen #6