Multiple input files

Stefan · November 27, 2017, 10:31pm

My strategy to compute a task in parallel is to chunk the multiple fasta input file into several pieces of say 100 sequences each. After all chunks have been completed, I need to merge results. Therefore, I wonder how I can specify multiple input files for a QIIME 2 plugin.
I am looking for something like in the metadata plugin, but from looking at the code I could not figure out what the right way of use is.

I saw from the changelog that multiple arguments for the same parameter are prohibited.

thermokarst · November 27, 2017, 10:55pm

Hi @Stefan! Can you provide some clarification here - are you looking to specify multiple user inputs? Or are these multiple inputs some kind of intermediate data, produced by some tool as part of the internal process of your Method?

Stefan · November 28, 2017, 4:23pm

I need to operate on multiple user inputs, i.e. artifacts. They are all of the same type.

thermokarst · November 28, 2017, 5:46pm

Okay! How is the user going to be chunking these data? Are you envisioning them supplying n pre-made chunks to this Method, which will then merge the data? That seems a little cumbersome, versus the user providing their data all at once and then the chunking happening internally to the qiime2 method. Do you have some example code floating around on a branch somewhere?

Stefan · November 28, 2017, 6:05pm

What is already happening is that users independently run fragment insertion for several microbiome studies. If they then decide to do a meta-analysis it is not necessary to run fragment insertion on a merged biom table. Instead, to save some compute, the user could collect all his/her placement files and merge them into one, from which one unified insertion tree can be computed - as long as all individual runs used the same reference tree.

Another use case is to chunk your input to distribute it across a grid, but that is more of an edge case. (Since it is cluster specific and trivial, I don't want to add explicit support with the plugin to chunk input and execute them)

In the end, I don't have control over the number or names of the placement files a user wants to merge.

thermokarst · November 28, 2017, 6:21pm

Would users be able to use the existing merging methods provided in feature-table: merge-seq-data and merge? These methods allow for merging of sequences and feature tables, respectively. We are currently in the process of implementing variadic inputs, so once that lands, these merge methods would accept any number of inputs. If I am off base here please let me know!

HPC environments are on our radar for the future (maybe 2018?) - we would like to expose a handful of elements within the framework itself to make this pretty straightforward for plugin devs, but it is still a bit nebulous. Stay tuned!

Stefan · November 28, 2017, 7:25pm

The data I need to merge are neither feature-tables nor sample metadata, but feature specific information. Thus, I cannot use functionality from feature-table.

Some background:
A "placement" is the internal node of a reference phylogeny (like Greengenes) for a specific nucleotide sequence, like those from DADA2 or Deblur. Those placements are thus independent of the experiment. Therefore, if the same sequence occurs in several experiments, we don't need to recompute the placement several times, because it will be always the same.

thermokarst · November 28, 2017, 7:29pm

I just want to make sure you saw that the two methods I posted above are for merging features tables and sequences, not for merging sample metadata.

Can these data be expressed as a taxonomy? Users could use the merge-taxa-data method if that was the case.

Stefan · November 28, 2017, 7:42pm

No, those data cannot be expressed as a taxonomy lineage strings, because it is not sure how many fields each placement has.

ebolyen · December 1, 2017, 10:42pm

Hi @Stefan,

It kind of sounds to me like you have a new FeatureData[Placement] type (or something like that). You should be able to recycle the strategy that the other merge methods use (there aren't any privileged plugins in QIIME 2, so anything that one plugin does, any other plugin can also do).

It does sound like you might benefit on waiting for variadic inputs however, which will be supported soon. Here's a recent forum post on that.