Hi @psps!
Re: original question, I usually do this trick to get the "raw" data. Keep in mind, what you get back is a reference to the backing directory, so it can't "outlive" the artifact object (as it is cleaned up when that leaves scope). Also, best not change anything in that directory if you intend to use the artifact anywhere (you won't hurt the .qza, but the object will be inconsistent).
# this var can't outlive `some_artifact`
dirfmt = some_artifact.view(some_artifact.format)
plain_path = str(dirfmt)
pathlib_path = dirfmt.path # technically a subclass named
# InPath, it works hard to be immutable.
As to your second reply, nice digging!
-
Yes, it is common to have more than one file, although not nearly as common as just one file. For those situations it's a little tricky since it depends on which format you have. You are kind of in a fuzzy space between API for end users, and API for plugins, and it's not really awesome for either yet. I'll expand more on this below.
-
Excellent catch! I wonder if this is how we end up with mysterious file-not-found issues once in a blue moon, it certainly looks like it could be the culprit (we use copy_tree in a few places, provenance in particular if I recall). I would really like to explore this some more. Do you have a particularly easy way to cause this situation to occur?
-
That would be absolutely amazing, and we would be happy to support you in that endeavor! Our developer docs are in this repo. We have some automatically generated API docs, but right now a real weakness is the lack of docstrings as we were in a bit of a hurry to reach feature parity with QIIME 1.
To expand on point 1 a little more, I'll outline what all is going on in Artifact.format.file.pathspec
since it could be useful to others (like you said!):
-
artifact.format
is a DirectoryFormat, which is in particular, the directory format listed inmetadata.yaml
in the archive. This is decided by whatever plugin definedregister_semantic_type_to_format
for the type that was used, at the time of creation. -
format.file
is aBoundFile
(fromqiime2.plugin.model.directory_format
) which has a pathspec (regex or plain string). Now, there is a detail here that is in no-way apparent:file
is the default member name for aBoundFile
for directory formats which areSingleFileDirectoryFormats
. These are special ones which always have a single file, so there's no real mystery about their contents. The directory format in this case is really just a wrapper. We use the fact that the file (whatever it is) inside these are named with the propertyfile
in our transformers, to convert a single file into it's corresponding directory format and vice-versa. It's not super interesting, it was just a convenient way to implement that.What this does mean, is that not every directory format will have a
file
member, for example, theCasavaOneEightSingleLanePerSampleDirFmt
has aFileCollection
named.sequences
. BoundFileCollections do have an API like.iter_views()
which works like.view
but mapped over the collection (these invoke our transformation system). It's not great, but it does exist. The actual problem, is there's not a way to identify what BoundFile[Collection]s exist, unless you introspect the class, so it's probably easier in most circumstances where you don't really care, to just iterate over the directory contents manually.