New User UX for viewing Artifacts and Visualizations

kastman · April 10, 2018, 10:45pm

As a new user to Q2, I'm a bit confused by how to actually get "into" artifacts and visualizations. I even wrote a PR to the docs that was totally wrong, expecting qiime tools view to work for both artifacts and visualizations. Shifting discussion from that PR, I wanted to ask a few questions:

Is there any reason for peek and view to be separate tool commands, at least from a UX perspective? These are just pickled objects, right, so deciding which processor should be easy. Is there a reason not to combine the commands, given that you'd alias one to avoid breaking? I'd imagine wanting to view both artifact and vis.
I want to get a view of the relevant raw components of an artifact (e.g. the first few header / [qual] / sequences of a fasta/fastq, or the first few rows of a feature table). Essentially cat sample.fastq | head. However, it seems like peek (and even the view.qiime2.org links from the tutorials) provide metadata, uuid, and provenance, but don't actually tell me what the data is. Is there yet another viewer to do this? Is it something people aren't doing right now? Especially since each semantic type is a first class citizen, this seems like a really necessary step for adoption.

Anyway, it looks like QIIME2 is a huge step forward and very promising. Hope to continue with it, and hope that these nit-picky points end up helping. Look forward to hearing people's thoughts,

ebolyen · April 13, 2018, 5:23pm

Hi @kastman!

Thanks for the great topic!

q2view definitely blurs the line between them so from a user standpoint answer is probably "no". However there is a mechanical difference. Peeking into an artifact or visualization does not require extracting anything. Instead an offset into the zip file to find the metadata.yaml file is done which is faster than extracting the entire archive which is necessary for view to work.

In that context, peek becomes useful for allowing other interfaces (like q2studio) or even users to perform type-checking ahead of time without needing to extract every artifact it may be looking at.

They are technically zip files with a specific directory layout.

There is a qiime tools export command which get's you half the way. The tricky part is there isn't necessarily a single file to look at (say with head). For example SampleData[SequencesWithQuality] is a directory of fastq.gz files with illumina naming convention and a manifest (for future proofing). The manifest is likely a good target for some kind of "head" command in the example, but how would the computer know that?

Right now, there isn't a generic way to tell the framework: "When you see this format (from metadata.yaml) you can summarize it with this code". It's something we've definitely talked about, but since you can also just export the data, it hasn't ever really become a critical priority for us.

I would expect that kind of registration to be done on the semantic type itself, and it could invoke the same transformer system that methods/visualizers use.

kastman · April 22, 2018, 8:45pm

Sorry for the delay here, and thanks for the explanation of the current state of affairs. That directory structure looks well thought out, and probably not over-engineered. It took me a second to realize it was recursive, but made sense as soon as I saw that.

I'm totally onboard with a quick type-checking at an offset, but perhaps peek is a misnomer; perhaps typecheck or metadata to indicate that what it's used for? I agree that checking the type of the artifact is useful, but it may not be the most likely hit for what a standard user cares about when they want to "peek" at their data? Again, I'm not sure this is such a big deal that you would want to break compatibility, but is a consideration.

I understand from a developer perspective why "you can always unzip the artifact and see what's inside", but with complete honesty, from a user perspective I'm going to be noticeably more reluctant to bother with artifacts if I can't even get a glimpse of what's inside them without an extra unzip step. It's not a real pain to unzip, but more of a psychological one. I agree that printing the manifest might be an alright first step, but I probably care much less about a directory listing of fastq filenames than I do about checking the header lines, sanity checking that the quality scores are well-formed, checking sequence length...

I agree that it should probably be the responsibility of the type itself (which also should ideally make it easier to split up the work) and would generalize over the CLI / API / Studio easiest that way.

Sorry that this conversation doesn't help much besides saying "just unzip and look". It may not be a bad idea to add a bit of discussion in the help somewhere explaining that's the current recommended way of dealing with artifacts. It would be great when types are extended, and I'll keep my open for it, but I understand there are probably higher priorities. Thanks,

Mehrbod_Estaki · April 23, 2018, 9:11am

I think the idea of being able to get a true peek without exporting/extracting is very helpful, especially for troubleshooting. Perhaps a less complicated, and of course less ideal way get a true peek option would be to store a few lines (head or tail output?) of the sequences within the artifact to call on? Of course this doesn't make sense for all artifact types but for joined reads for example would be pretty helpful. I think storing the dimensions/wordcount might be also helpful, that bit can even go in provenance maybe?

kastman · April 23, 2018, 8:32pm

Are you using the blocked zip (BGZF-ish)? I wonder if you could store the offsets of the first and last data blocks perhaps for quick head/tail? That might be tricky or self-referential, but maybe worth considering (i.e. storing the whole artifact as a tar, but just bzgf'ing the data? Probably overly complicated for the use-case unless there are other ways to do it this way.

Also allowing types to have their own summary data is a good idea too. Definitely dimensions, and I could see wordcount being useful too, though this is a slippery slope; i.e. why not histogram of all quality scores, sequence character count / GC %...?

thermokarst · April 23, 2018, 8:48pm

Thanks @Mehrbod_Estaki & @kastman!

for this type of mechanism!

peek isn't just a type check - it provides information about the UUID, the format, and the type; and the definition of metadata in the context of QIIME 2 is already pretty well established:

qiime tools peek --help
Usage: qiime tools peek [OPTIONS] PATH

  Display basic information about a QIIME 2 Artifact or Visualization,
  including its UUID and type.

I think the kind of insight you are trying to gain in this case is different than what we are defining as peek in QIIME 2 - maybe a more appropriate word would be stats or summary?

By the way, have you had a chance to check out the Artifact API? You can do things like this, which might help out with understanding the content of your data outside of QIIME 2, without "limiting" you to QIIME 2:

>>> from qiime2 import Artifact
>>> unrarefied_table = Artifact.load('table.qza')
>>> rarefy_result = feature_table.methods.rarefy(table=unrarefied_table, sampling_depth=100)
>>> rarefied_table = rarefy_result.rarefied_table
>>> import biom
>>> biom_table = rarefied_table.view(biom.Table)
>>> print(biom_table.head())
# Constructed from biom file
#OTU ID      L1S105  L1S140  L1S208  L1S257  L1S281
b32621bcd86cb99e846d8f6fee7c9ab8     25.0    31.0    27.0    29.0    23.0
99647b51f775c8ddde8ed36a7d60dbcd     0.0     0.0     0.0     0.0     0.0
d599ebe277afb0dfd4ad3c2176afc50e     0.0     0.0     0.0     0.0     0.0
51121722488d0c3da1388d1b117cd239     0.0     0.0     0.0     0.0     0.0
1016319c25196d73bdb3096d86a9df2f     11.0    17.0    12.0    4.0     2.0
>>> import pandas as pd
>>> df = rarefied_table.view(pd.DataFrame)
>>> df.head()
        b32621bcd86cb99e846d8f6fee7c9ab8  99647b51f775c8ddde8ed36a7d60dbcd  \
L1S105                              25.0                               0.0
L1S140                              31.0                               0.0
L1S208                              27.0                               0.0
L1S257                              29.0                               0.0
L1S281                              23.0                               0.0
...

Anyway, for now, the options are to use the types appropriate summary/tabulate visualization, use the Artifact API directly, or, export/extract and use a third-party tool. I totally agree though, adding in a new low-level command for interrogating the contents of the data would be really important, and could help streamline things - it has been on our radar for a while now, but in practice it hasn't proven to be as big of a hole as I initially anticipated. Thanks!

ebolyen · April 26, 2018, 7:09pm

Zip files actually make this very easy to pull off, so I don't think we need to do anything too special necessarily.

I believe so, we're using the default which is DEFLATE (zlib). This is a block-based compression, so tail should be possible with some work.

In fact the zip structure contains the offset to each file in it's central directory (we're technically using a 64bit extension of that), which is stored at the end. So you can lookup an arbitrary file by doing some easy math with the offsets. This is why we use a zip file instead of a tar file. With a bit more effort tail should be possible by decompressing the blocks and stitching them together again (zip header again has a compressed and uncompressed data length, so EOCD - offset + compressed-file-length get's you that information).

Historical note, early iterations actually were tar files! (we called them .qtf for qiime tar file) but we realized random access of the components would be really useful. q2view would be very hard to pull off without that property!