Looking for Comprehensive Qiime2 Documentation

Hey all,

I am fairly new to Qiime2 and feel like I am struggling a bit to transition to the Qiime way of doing things. Hopefully this doesn't come across as overly whiny, but relative to other bioinformatics tools, Qiime feels unusually black-box-y and unintuitive to me. Among other things, I think it's because 1) all files use the same two extensions and are zipped by default, 2) you frequently need to open files in a browser to get a sense of their contents, 3) files use somewhat cryptic semantic types, and 4) it feels tricky to figure out how to go from A to B in an analysis once you're ready to go beyond what's covered in the Moving Pictures tutorial. Note, I completely understand that there are good and valid reasons for the above, I'm just describing the net effect to a (or at least this) new user.

Every couple of days, I'm left with browser windows that look like this, as I try to piece together information from across the forum:
Image on 2025-06-04 02.29.06 PM

For context, I'm aware of the following resources and have read, skimmed, or watched all of them:

In the above, I'm not seeming to find the following kinds of information, which would be really helpful to me.

  1. For any given artifact class (e.g. FeatureTable[Frequency]), illustrations of what the data usually looks like. I'm envisioning something like the Data Schema/Data format descriptions you can get in the UCSC Table Browser, e.g.:

    Perhaps you could also include links to a few minimal examples that could be viewed in view.qiime2.org, like in the Moving Picture tutorial.
  2. For any given artifact class (e.g. FeatureTable[Frequency]), the set of qiime commands that can manipulate that class.
  3. Basically, the inverse of 2. For any given qiime (or qiime plugin) command, the list of available classes it can process (i.e. its available inputs and outputs) with links to example data (e.g. developed as part of 1.) illustrating what the data looks like before and after having been processed by said command. Ideally, there would also be links to the relevant paper describing the method behind the command and/or links to high-quality summaries of what the command is doing on the forum.
  4. Some sort of (possibly AI-powered) graph/network-like utility that can map out how to get from A to B in an analysis. Or put another way, a sort of interactive flowchart for the entire Qiime action/object universe, constrained by the data manipulations that are possible in Qiime2. I'm thinking of something along the lines of a neo4j graph database. Ideally, this utility could help answer questions like: Given a FeatureTable[Frequency], what is the shortest path (and corresponding commands) for obtaining a FeatureData[Taxonomy].
    For a more concrete example (which in part prompted me to post this), I processed a dataset through to ANCOM-BC following the Moving Pictures tutorial, and produced some differential abundance outputs like this.

    I also classified the features with GreenGenes2. However, I can't seem to map the features in the differential abundance plot to their relevant taxa (as the Feature IDs don't match), and I have no idea which transformation I should do (or shouldn't have done) to get from A to B here.

    If I had a tool that showed me where I was in the Qiime2 pipeline universe and I could look at the inputs and outputs of nearby "nodes", I could potentially figure out what commands to run, but as it is, my main option is to bug you fine folks for help.

Aside from the above, I haven't been able to find information about the following:

  1. I know how to view the provenance of a file, but is there a way to output the actual Qiime2 CLI commands that were run to generate it?
  2. Is there any way to have Qiime2 automatically add some sort of default suffix or prefix to the files it outputs for any given command? That would save me the trouble of having to name all my files manually, and it would allow one to see at a glance how the file was processed. Perhaps when the filename got long enough, the earliest portions could start being reversibly hashed or something.
  3. I do most of my work on a remote server, and it's kind of a nuisance to have to transfer Qiime files locally to view them in a browser. As it is, zipped files (and tons of datatypes) can be viewed with Visidata, but I was wondering if you have any other suggestions for dealing with this and/or whether there is any work in progress to make more data viewable on the command line?

Ok, I think that's it. Looking forward to your feedback!

2 Likes

Hi @charlesalexandreroy, Thanks for your input and questions. Some of the things you're asking for exist, and many are 100% possible but limited by developer bandwidth. Some are in progress through a new interface that we're working on called Adagio, and we'd be happy to put you on a list of test users for that interface when we're ready for more test users (disclosure: that interface is being built by a for-profit company that I co-own).

I know how to view the provenance of a file, but is there a way to output the actual Qiime2 CLI commands that were run to generate it?

This is called provenance replay, and it is installed by default in all modern QIIME 2 distributions. Call qiime tools replay-provenance --help for more detail. Also see Keefe et al. (2023).

Is there any way to have Qiime2 automatically add some sort of default suffix or prefix to the files it outputs for any given command?

The closest that we have to this would be if you use the --output-dir option when calling QIIME 2 commands from the command line. This will use the output names to name the files in the directory that is created to store the output. I'll use this as follows (for example):

qiime dada2 denoise-paired --output-dir dada2-run1 ...

Then, within output directories, all outputs from related commands will have the exact same file names and I can embed some description of what is in there in the output directory name. (Remember, that's just a helpful description - the definitive source of what was done is always the data provenance.)

I do most of my work on a remote server, and it's kind of a nuisance to have to transfer Qiime files locally to view them in a browser.

If you configure x-forwarding (or similar - depends on your remote server configuration) and there is a web browser on that server, using qiime tools view might work. We don't have another option at this point, though we are interested in making qiime tools peek provide some artifact-class-specific information, like the first 50 sequences in a FeatureData[Sequence] artifact. Part of the issue is that real .qza files (i.e., not from test data sets) tend to contain a lot of information, so it's hard to present anything meaningful that is human readable. For that reason, it hasn't made it to the top of the priority list. qiime tools export, followed by viewing as you normally would for a large fasta/newick/etc, is also a straight-forward option and you could probably create a shell script or alias that would simplify this workflow. Another part of this is simply limited developer bandwidth: I love the idea of more useful and accessible summaries of artifacts, but we'd probably need some dedicated funding for developer time, or a community developer to step in, to actually make it happen. If either of those are something you could contribute, let's chat!

For any given artifact class (e.g. FeatureTable[Frequency]), illustrations of what the data usually looks like.

This is definitely planned for the documentation - not exactly sure which yet, but probably Using QIIME 2. We've started work on this (e.g., call qiime tools list-types and you'll see that some of the artifact classes have text descriptions) - adding more descriptions, and linking examples, is something that's on our radar (again, contributions are more than welcome, and adding more of these is straight-forward - very minimal coding - let me know if you're interested!).

For any given artifact class (e.g. FeatureTable[Frequency]), the set of qiime commands that can manipulate that class. ... For any given qiime (or qiime plugin) command, the list of available classes it can process (i.e. its available inputs and outputs) with links to example data (e.g. developed as part of 1.) illustrating what the data looks like before and after having been processed by said command. ... Some sort of (possibly AI-powered) graph/network-like utility that can map out how to get from A to B in an analysis.

This is totally possible - in fact the whole system was designed with these types of queries in mind. You can find a step toward this in QIIME 2 on Galaxy (which you can try through our new Galaxy Docker container - see here) - notice for example that when you select a command, only data of the relevant type(s) are shown as possible inputs. We're building toward more functionality like what you're describing in Adagio, and this type of thing can be coded up using the QIIME 2 software development kit (sdk). More content on this is planned for Developing with QIIME 2 under interface development. So again, totally possible, and planned, but limited by developer bandwidth.

Thanks for bringing this up - the fact that this is possible is a strength of QIIME 2 relative to many other bioinformatics tools, and we're doing all we can to make the most of this functionality so it can facilitate discoverability by users.

Ideally, there would also be links to the relevant paper describing the method behind the command

This information is available by running (for example):

qiime dada2 denoise-paired --citations

I can't seem to map the features in the differential abundance plot to their relevant taxa

For this specific question, I'd switch to ANCOM-BC2 and take a look at this section in the gut-to-soil (g2s) tutorial.

Thanks again for your detailed message, and your willingness to explore all the different sources of documentation! We're working on expansion of the documentation and functionality, and also improving discoverability of all of the various resources. Lots more coming soon, and we hope to be continually improving the user experience and the cutting edge functionality that you can expect from the platform.

One final note: a lot of the points you're bringing up here are things that the QIIME 2 Framework (Q2F) offers, and are not specific to the amplicon distribution, etc. So, they're available (or will be) for all distributions and stand-alone plugins (e.g., MOSHPIT, q2-kmerizer, ...). I'm guessing this is clear to you since you've been through Using QIIME 2, but just mentioning for others who come across this discussion.

:qiime2: forum moderators and other users - anything I'm forgetting to highlight here?

:fire: :fire:

5 Likes

Hi Greg,

Thanks so much for your time in preparing this detailed response; I really appreciate it! And I totally understand the limitations imposed by developer bandwidth. While I am passionate about creating high-quality documentation, for the time being at least, I don't have any spare bandwidth to volunteer on the Qiime documentation myself. Happy to be added as an eventual test user of Adagio, though!

When you have a chance, I have a few more (somewhat pedantic) terminology and documentation-related questions for you.

I read through this page, which explains the difference between file types, data types, semantic types, and artifact classes. As stated on that page, "[t]here is a many-to-many relationship between file types, data types, and semantic types... [and i]t’s possible that a given semantic type could be represented on disk by different file types." Given that, it would seem like artifact classes (which appear to be a more recent addition) were created to bridge the gap between semantic types and file types so that plugin software knows exactly how to process the files in an artifact of a given class. Does that partially explain the origin story of artifact classes? :man_supervillain::woman_superhero: If my understanding is correct, there would seemingly be a one-to-one relationship between file types and artifact classes and a one-to-one relationship between artifact classes and semantic types, is that right?

And then, regarding semantic type and artifact class naming conventions, do they follow these REGEX patterns (while also seemingly using camel case (which I don't know how to represent in REGEX :sweat_smile:))?:

  • Semantic Types: ^[A-Za-z0-9]+(?:\[(?:[A-Za-z0-9]+(?:, [A-Za-z0-9]+)*)\])?$
  • Artifact Classes: ^[A-Za-z0-9]+$

If so, I just wanted to ask about a few places where the documentation might be inconsistent.

  1. When you run qiime tools list-formats..., it appears to list the importable/exportable artifact classes. Perhaps the command should be qiime tools list-classes...?
  2. When you run qiime tools peek <artifact>, it appears to label the semantic type as "Type" and the artifact class as "Data format", which could be clearer.
  3. In the reference section of the amplicon docs, there's a page called "Artifact Classes". However, the entries on that page appear to be organized by plugin and then by semantic type. Each semantic type has a "format" which appears to map to an artifact class. Similarly, the Formats page appears to be organized by plugin and then by artifact class. If my understanding is right, those pages could be named more consistently.

The above items make me wonder whether "format" is a synonym of "artifact class", which would resolve some of those discrepancies. However, further down the page I referenced at the top, there is this paragraph (emphasis mine), which seems to distinguish between artifact classes and formats :

Most of the time, plugin developers are more concerned with artifact classes, rather than semantic types and formats directly, though we only recently starting transitioning the language used in our documentation to reflect this. You may still see some outdated usage, such as treating the terms semantic type and artifact class as synonyms, especially in older video content. Sorry for the confusion!

If "formats" are distinct from artifact classes in some way, how do they relate to the concepts of file types, data types, semantic types, and artifact classes?

Apologies if I've confused myself and/or if this has been clearly documented somewhere that I missed! Thanks again for all the work you and your team have put into creating and documenting these open-source production-quality tools!

Best,
Charles

Hi @charlesalexandreroy,
Thanks for your follow-up question, and for your interest in learning about all of this. I'm going to try to answer your questions here. Let me know if you still have questions (just FYI my replies might be slow but happy to help you understand).

The term Artifact Class is relatively new, but the structure has been in place since we started work on the QIIME 2 Framework (Q2F). This lesson, from the Developing with QIIME 2 plugin tutorial, might be a good reference for you too as it shows how an Artifact Class is created. I'm going to use the types/etc from that lesson in my examples here.

Briefly, a format (also file type) defines how data is stored on disk (e.g., SingleRecordDNAFASTAFormat, or a fasta file that has exactly one record). A semantic type defines the meaning of that data (e.g., SingleDNASequence, or a single DNA sequence). An artifact class is what a Q2F action can take as an input, or what a Q2F method can produce as an output (e.g., SingleDNASequence). The label for artifact classes look identical to those for semantic types, and previously we conflated these ideas[1] but there is a subtle difference. A semantic type on it's own is just a concept: it's just the idea (eg) of a single DNA sequence. It's not until it's linked to a default on-disk representation that Q2F actually knows how to read or write one of these - at that point, we have an Artifact Class. Lines 41-58 here show this in action, and the corresponding text that I linked to above goes through what is happening there.

So, back to the relationships between these terms.

  • A semantic type can be associated with one or more file formats. (For example, a sequence can be represented in a fasta file or a fastq file.)
  • A file format can be associated with one or more semantic types. (For example, a fasta file can store a DNA sequence or an RNA sequence.)
  • A semantic type can be represented by zero or one artifact classes. A semantic type that is represented by zero artifact classes isn't practically useful, but it might exist for example if a plugin developer is just mapping out ideas. Because a semantic type can be represented by at most one artifact class, using the same label for both makes sense.
  • An artifact class can have exactly one semantic type.
  • An artifact class can have exactly one format associated with it.[2]

Some of the issues that you noted relating to terminology are almost certainly truly issues with how we're describing things and are lingering from our adoption of the new terminology. I'll review the examples you enumerated and create issues as needed.

Hope that helps clear things up. Gotta run for now!


  1. Around when we started working on DWQ2 we introduced this term so we had a way to talk about this distinction. ↩︎

  2. A caveat here is that other formats can be associated with the semantic type, and it's possible to transform a format to another. This could be useful, for example, if you want to export data from an artifact class in a different format than is registered to it - the FeatureTable[Frequency] artifact class is a good example of this. (This also allows us to update the format associated with an artifact class while maintaining backward compatibility with old artifacts, for example if a fancy new format becomes available for representing data that is relevant to the ecosystem.) ↩︎