QIIME 2 Plugin query: variable number of inputs/outputs

mroper · November 18, 2022, 9:47am

Dear Prof. Dr Bokulich,

I am writing to you as an author of QIIME2 because of your postings here: [Help with making a plugin for core microbiome].

I am looking to write 2 plugin methods in a plugin that I am developing. The first needs to read in a variable number of archives that I could possibly transform to Metadata via a transformer, it will also output 1 archive. The other needs to read in 1 archive but needs to output a variable number of archives that simply cannot be set in advance. You mentioned at [Help with making a plugin for core microbiome] in February '19 that "Right now, the number of output files generated by QIIME 2 plugins must be fixed, though we are working on that (e.g., to allow optional output files).". Could you please advise if this change to QIIME 2 is now available? If so, how? I could not find documentation on this.

I would also be interested to know if it is possible to take a variable number of input archives when they cannot be transformed into Metatada.

With many thanks for your time and help

Nicholas_Bokulich · November 18, 2022, 9:57am

Hi @mroper ,

Welcome to the forum!

This is already possible. You would register the input as a List of type X. See this example in q2-feature-classifier for the action registration:

github.com

qiime2/q2-feature-table/blob/dev/q2_feature_table/plugin_setup.py#L188


      
          )
          
          # maps input types to relevant overlap methods and output types
          i_table, p_overlap_method, o_table = TypeMap({
              (FeatureTable[Frequency],
               Str % Choices(sorted(q2_feature_table.overlap_methods() - {'union'}))):
              FeatureTable[Frequency],
              (FeatureTable[RelativeFrequency],
               Str % Choices(sorted(q2_feature_table.overlap_methods()
                                    - {'union', 'sum'}))):
              FeatureTable[RelativeFrequency],
              (FeatureTable[PresenceAbsence],
               Str % Choices(sorted(q2_feature_table.overlap_methods()
                                    - {'average', 'sum'}))):
              FeatureTable[PresenceAbsence]
          })
          
          plugin.methods.register_function(
              function=q2_feature_table.merge,
              inputs={'tables': List[i_table]},
              parameters={

That works too. In that case you would not input a List, only a Metadata object. QIIME 2 Metadata objects have a merge method, see some documentation here:
https://dev.qiime2.org/latest/metadata/#merging-metadata

Some interfaces, e.g., the CLI, automatically merge metadata: the user simply inputs multiple metadata files as input. For other interfaces (e.g., the python API), users would merge their metadata objects (see API docs above) prior to passing to your action.

That is currently not possible to my knowledge. Let me see if others can comment on this.

Yep! see the first example above in q2-feature-table. I think any semantic type can be passed as a List, as long as they are all the same type.

mroper · November 18, 2022, 11:39am

Thank you!

It would be great if I could export to a variable length list of Archives. I look forward to hearing if this is possible on the forum.

lizgehret · November 18, 2022, 6:53pm

Hi @mroper,

Can you provide us with more detail on your use case for this? What will your input be, and what action will create this variable number of artifacts? How will the number of artifacts be determined within your method?

We do have some work on the horizon that may address this, but I'd like to hear more about your use-case to determine if this will be something that will work for what you are trying to accomplish. Thanks!

mroper · November 19, 2022, 1:01am

Hi @lizgehret ,

I have a very long running action. How long the action runs for is dependent on the size of the input archive. So I want to create an action to split the input archive into parts (which will just be archives) with the number of parts being dependent on a numerical parameter. Then I can run the original action on the resultant set of archives in parallel on a cluster or on a single machine in a way that allows me to create backups along the way.

I don't want to have to modify the original action to achieve this goal.

It would be great if you could do this!

Thanks

jwdebelius · November 21, 2022, 6:05pm

Hi @mroper,

Hopefully it's okay that I offer a potential solution. This is based on what worked in my plugin and very much represents my view.

I can come up with two ideas to do this that may or may not work for you. (I've had a similar problem, although I built it into my functionality.) I've been working with dask as a python library to handle parallelization. It includes a way to write/save individual text files. You may have to design a dask-specific directory or data format to generate this. (For example, if you write multiple individual dask files). If you have a python function, this can be wrapped through pretty simple function decorators. I'm less certain about a command line application, but it seems like you could again build a directory format that saved partial files rather than complete ones.

I'm not sure how you'd handle the issue of hanging or unfinished jobs within htis framework, though. If my job goes 95% of the way through, and then fails on one sample, losing that would be very frustrating. It would be nice to have an intermediate object that waited. I can imagine some inelegent solutions to this issue, like allowing the user to pre-chunk their data via a manifest, and then merging on the other end.

Best,
Justine.

lizgehret · November 21, 2022, 6:40pm

Hey @mroper,

Thanks for providing that context! We will actually be working on an update to the framework that will essentially handle the exact scenario you've described. Here's a current description of this proposed workflow:

For your needs, you may end up needing to turn your method into a pipeline within the framework in order to utilize this functionality - but again, this is still in the works, so we will have more exact details once this is fully built out (hopefully by sometime early next year).

In the meantime, the suggestions from @jwdebelius may be some good options to investigate!

Cheers

mroper · November 22, 2022, 10:27am

Hi @lizgehret and @jwdebelius

Thanks very much for your responses. I have implemented the workaround of modifying the long running action to take a subset of the input.

With many thanks to you both!