Name of exported file

Hi. I am using the python Artifact API. I was wondering if there is a way to get the file name(s) that the artifact is exported to when you use qiime2.sdk.Artifact.export_data() from the Artifact object itself ? Is there some attribute/property/field/method of the Artifact that returns the filename?

I’ve read through the Artifact class doc https://dev.qiime2.org/latest/api-reference/sdk/ , but it is a bit sparse. I’ve also tried to look through the source code, but it seems split between different repos, and I’m not sure which contains the parts for a particular sematic type or if it would be a general property.

Thanks!

3 Likes

Ok, I tried to figure out the answer to my own question! Hopefully, what I found is helpful:

  1. After more digging through the source code, I found that Artifact.format.file.pathspec is the file name - at least for Artifacts that only consist of 1 file (I didn’t have cases where it was more than 1 file - is it possible?). But, I’m not sure this is the preferred, idiomatic way to get it?

  2. I also noticed that qiime2.sdk.Artifact.export_data() uses distutils.dir_util.copy_tree() and the version of distutils in the qiime2 conda environment seems susceptible to this bug https://stackoverflow.com/a/28055993 . (I ran across it in trying to figure out this filename stuff because I was iterating saving and removing files )

  3. I’m not sure how contributions work with this project, but I can offer help in writing/improving the API documentation.

3 Likes

Hi @psps!

Re: original question, I usually do this trick to get the “raw” data. Keep in mind, what you get back is a reference to the backing directory, so it can’t “outlive” the artifact object (as it is cleaned up when that leaves scope). Also, best not change anything in that directory if you intend to use the artifact anywhere (you won’t hurt the .qza, but the object will be inconsistent).

# this var can't outlive `some_artifact`
dirfmt = some_artifact.view(some_artifact.format)


plain_path = str(dirfmt)
pathlib_path = dirfmt.path   # technically a subclass named 
                             # InPath, it works hard to be immutable.

As to your second reply, nice digging!

  1. Yes, it is common to have more than one file, although not nearly as common as just one file. For those situations it’s a little tricky since it depends on which format you have. You are kind of in a fuzzy space between API for end users, and API for plugins, and it’s not really awesome for either yet. I’ll expand more on this below.

  2. Excellent catch! I wonder if this is how we end up with mysterious file-not-found issues once in a blue moon, it certainly looks like it could be the culprit (we use copy_tree in a few places, provenance in particular if I recall). I would really like to explore this some more. Do you have a particularly easy way to cause this situation to occur?

  3. That would be absolutely amazing, and we would be happy to support you in that endeavor! Our developer docs are in this repo. We have some automatically generated API docs, but right now a real weakness is the lack of docstrings as we were in a bit of a hurry to reach feature parity with QIIME 1.


To expand on point 1 a little more, I’ll outline what all is going on in Artifact.format.file.pathspec since it could be useful to others (like you said!):

  • artifact.format is a DirectoryFormat, which is in particular, the directory format listed in metadata.yaml in the archive. This is decided by whatever plugin defined register_semantic_type_to_format for the type that was used, at the time of creation.

  • format.file is a BoundFile (from qiime2.plugin.model.directory_format) which has a pathspec (regex or plain string). Now, there is a detail here that is in no-way apparent: file is the default member name for a BoundFile for directory formats which are SingleFileDirectoryFormats. These are special ones which always have a single file, so there’s no real mystery about their contents. The directory format in this case is really just a wrapper. We use the fact that the file (whatever it is) inside these are named with the property file in our transformers, to convert a single file into it’s corresponding directory format and vice-versa. It’s not super interesting, it was just a convenient way to implement that.

    What this does mean, is that not every directory format will have a file member, for example, the CasavaOneEightSingleLanePerSampleDirFmt has a FileCollection named .sequences. BoundFileCollections do have an API like .iter_views() which works like .view but mapped over the collection (these invoke our transformation system). It’s not great, but it does exist. The actual problem, is there’s not a way to identify what BoundFile[Collection]s exist, unless you introspect the class, so it’s probably easier in most circumstances where you don’t really care, to just iterate over the directory contents manually.

1 Like

Also while I’m thinking of it, you seem to have more involved use-cases for our API than most, could I ask what you are looking to do? There’s plenty that could be improved upon, so we’d be happy to add things where needed, especially if you wanted to help :wink:

Thanks for the explanation and suggestion to look at the raw directory for the filenames @ebolyen. I wanted a general method to get the filenames that would be robust for different plugins and methods, some of which return Results made up of multiple Artifacts. I’m using the API to automate workflows integrating QIIME 2’s analysis with some other tools. For this reason, I’m also interested in this tantalizing Transformation API, and more documentation of the existing plugins/types/methods.

I’ll poke around your docs repo and see if I can contribute something!

For the copy_tree bug, I found it when I did export_data once, and then if I deleted the output directory using shutil or renamed it using os.rename, and ran export_data again, I would get file/directory not found errors - this was because I was running in a Jupyter notebook/python console, so the distutils cache didn’t get cleared as it might when a script exits (?). From that Stack Overflow thread, it appears distutils is mostly meant for packaging and installing python modules, and shutil.copytree is for “regular” use.

1 Like