is there information in provenance on the classifier file used?

taknotts · May 13, 2019, 10:01pm

I wanted to check the classifier version (file name) I used with a dataset and was trying to look through the provenance to find it but I could not locate it (inside action.yaml) for taxonomy.qza (which I thought would be the right place to look). Any help would be appreciated.

Thanks,

Trina

ebolyen · May 23, 2019, 6:34pm

Hi @taknotts,

Sorry for the delay

This is a really good question, so I will write a general tutorial on looking at provenance, since it doesn't appear we have any resources on this!

I'll show two ways to get this information using the moving pictures as an example, please adapt the approach to your specific analysis. The general theme is using UUIDs (and checksums) instead of filenames (as the UUIDs are always unique, whereas we can sometimes forget filenames or overwrite them).

Option 1 (using q2view):

Let's suppose we have a taxonomic barplot and we just don't believe what its saying, so we think something went wrong. I'll use this particular visualization:

https://view.qiime2.org/visualization/?type=html&src=https%3A%2F%2Fdocs.qiime2.org%2F2019.4%2Fdata%2Ftutorials%2Fmoving-pictures%2Ftaxa-bar-plots.qzv

If we look at the provenance tab we'll see something like this:

If we rearrange it to not look ridiculous and then select the "classifier" node, we can see the Universally Unique IDentifier (UUID) in the top right.

I can compare this ID to a classifier I have downloaded from data resources (or perhaps trained myself).
The UUID I'm looking for is this:
ccb8cd3a-ad34-44fc-882c-6428eed203df

Looking at some of the classifiers, I'm pretty sure I was using the GG 515/806 version, so I download that and look at the UUID (using the peek tab):

We see that the UUID matches what we were looking for, so I know for a fact that this barplot was made with this classifier I just downloaded (or perhaps trained).

Maybe I don't have the classifier at all:

Now supposing for whatever reason I don't even have the classifier any more, I can still look at provenance to try to identify the raw files used:

Here we see that the reference_taxonomy used had an "import" step (by clicking on the box instead of the circle we can see the action that produced the circle), and during that, a manifest was recorded of the filenames used, and the md5sum (which is a hash/fingerprint of the file). That means I could find a file on my system with (ideally, but not necessarily) that filename, and double check that the md5sum matches (on OS X this is done with the md5 command, on Linux md5sum, it's a common algorithm, so there are other implementations as well).

We can also do the same thing for the sequences used. Using this, we can pinpoint the raw data used for creating the database.

Option 2 (reading provenance manually)

This is a good fallback if all else fails, but its annoying as we need to walk the graph manually, step by step. I'll show how its done, and then you follow essentially the same steps as above. I have an editor with syntax highlighting so it will look pretty in my screenshots, but may be just black and white in other editors.

Because QIIME 2 artifacts and visualizations are just ZIP files, I can use my operating systems file browser to look inside (on OS X you do need to extract first).

The important bit is of course the provenance directory which I've expanded. There is an action sub-directory which is the action which made this particular result. Then there is also the artifacts sub-directory which has all of the ancestral artifacts used.

At this point you could actually skip the rest of the work, and notice that there is a directory named ccb8cd3a-ad34-44fc-882c-6428eed203df and so we must have used that classifier for this barplot, but let's see how we could trace backwards to it:
If we look in the file named provenance/action/action.yaml we see some text which looks just like the sidebar in q2view.

There's a lot of information, but we only care about the action section which follows the execution section. In particular we look at the inputs and see two: table and taxonomy. We need to be looking at the taxonomy, and we note that its UUID is 45444431-ebc8-4dbf-9f68-caab93f5e00e.

Looking back at our archive, we can find a directory provenance/artifacts/45444431-ebc8-4dbf-9f68-caab93f5e00e and expanding that we see:

Which has a directory structure just like our first action subdirectory. At this point we would look insi
de action.yaml for 45444431-ebc8-4dbf-9f68-caab93f5e00e and see a similar file as before:

Here we see the classifier used, and we can repeat this process until we reach the import steps, which will have no further UUIDs to trace and looks like this:

This is a lot more work than using q2view, but you'll notice the data is exactly the same as q2view, because that is how we store it in the zip file!

Hopefully that is helpful.

taknotts · May 31, 2019, 5:47pm

Thanks @ebolyen ! This is incredibly helpful as it gave me greater insight on how to use the q2view feature to look at the provenance. I never realized that selecting the box vs node (circle) gave different information- enlightening! (but perhaps it just didn't stick after going through the tutorials )

It seems like the most straightforward way to use this information is to record the uuid info for any classifier that you use or create (which you showed is easy to do by dragging the classifier.qza file into q2view and using the peek tab to copy and paste the uuid to keep in a file somewhere for future reference). With the information that you provided, I am planning to look for the uuids in the taxonomy.qza archive. In my script, I'll simply search for any version of the classifier I have created and write it out to text file using

unzip -p /path/to/taxonomy.qza '/artifacts//action/action.yaml' > /path/to/output/SilvaXXX_classifier_qza.txt

Unfortunately, it writes a text file for every uuid I look for- most are empty, only the one that was used has content. So in my limited script writing experience, I am just finding and deleting files that are empty so I am left with only 1 classifier_qza.txt file which tells me which one I used in my output folder.

find /path/to/output/SilvaXXX_classifier_qza.txt -empty -type f -delete

Is there a more straightforward way to do this?

Thanks,
Trina

ebolyen · May 31, 2019, 11:29pm

Hi @taknotts,

I am afraid I don't have anything simple to offer at this point (an area that could definitely be improved), but you have the right idea here. So I expanded your solution a bit to avoid the junk files:

The first trick, is that every artifact is stored in a root directory named after the UUID, which makes extracting a path troublesome, fortunately there's a qiime tools peek which gives us exactly that UUID. We can chain that with an awk, to get the number itself:

qiime tools peek taxonomy.qza | awk -F ' +' '{print $2; exit}'

Using that we can make a subshell command to fetch the UUID and prepend it to the path, so to get the action.yaml file we could do this:


#                     v--- calculate uuid -----------------------------------------v v------internal path--------v
unzip -p taxonomy.qza $(qiime tools peek taxonomy | awk -F ' +' '{print $2; exit}')/provenance/action/action.yaml

The prints the entire yaml to stdout. To get the classifer UUID, we can use the fact that the right line happens to have a space before the word classifier (other instances have a dash since they're plugin names). Using awk again, we can get the classifier UUID:

unzip -p taxonomy.qza $(qiime tools peek taxonomy.qza | awk -F ' +' '{print $2; exit}')/provenance/action/action.yaml \
  | grep ' classifier' | awk -F ' +' '{print $4}'

This is pretty gross, and I can't promise it will work forever, but hopefully by the time it stops working we'll have a much better way to query ancestry between artifacts, since its a sensible question to ask (and kind of the entire point of provenance!)

Hopefully you can see that combining this with the first command above for your classifiers should let you relate them, and this should be a reasonable jumping off point for other things.

If you want to get more sophisticated with the yaml parsing, you could use a tool called yq (named after jq using the same syntax). That would let you access fields by record name, which would be much more robust than grepping the "correct" line.

One thing to note, the filename is repeated twice here, once in the unzip, and again inside the subshell for qiime tools peek. If you are scripting this as part of another command (or just running it) you just need to remember to parameterize both of those.

Hopefully thats helpful, sorry we don't have better tooling for this.

system · July 2, 2019, 5:29am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.