Best approach to ID Mapping

(Zack H) #1

I am trying to map InChI strings to new metabolite identifiers for a plugin I am developing.

I have two TSV files of the following formats:

  1. A data file where column 1 contains InChI Strings, column 2 contains numerical values.
  2. A mapping file where column 1 contains InChI Strings, column 2 contains new identifiers.

The goal is to end up with an artifact containing a table where column 1 are the new IDs and column 2 are the corresponding numerical values
(This artifact will be the input to a visualizer)

My original idea was to define two SemanticTypes and a transformer to produce the desired result as a third SemanticType.
(e.g. InChIDataValues + InChIMapping --> transfomer --> NewID)

However, it seems that Qiime2 Metadata can also be used for mapping purposes, and I have yet to see an example of a transformer that turns two inputs into one output.

What would be the ideal way to develop this mapping pipeline?

(Evan Bolyen) #2

(Evan Bolyen) #3

Hi @zhaiman!

Sorry for the delayed response!

I was just talking to @mwang87 earlier this week about a similar situation in his plugin.

I guess I have a question to figure out what the best way forward might be:

In the mapping TSV #2, are the new IDs something a user provides or are they calculated by your plugin?

If they are user-defined, then I think Metadata is the way to go. This process sounds very similar to feature-table group which can be used to relabel IDs (if your groups happen to be a 1-to-1 map).

Otherwise, another avenue would be a Directory format which contains both of these files. They would be accessible as a single input for a transformer, because as you correctly observed, transformers are limited to single input and single output. Although I’m not sure how the transformer itself would fit into your overall pipeline here.

(Evan Bolyen) #4

(Zack H) #5

Hi @ebolyen,

Thank you for the response!

At this point in time, the user will not need to provide any IDs. Most mappings are 1-to-1, but occassionally there may be one InChI mapping to multiple IDs.

If I were to use a Directory format that contains both TSV files, would the following approach work?

class UnmappedDirectoryFormat(model.DirectoryFormat):
    mapping = model.File(r'mapping_file.tsv', format=MapFormat)
    unmapped_data = model.File(r'unmapped_data.tsv', format=UserDataFormat)

'MappedDirectoryFormat',  mapped_data.tsv', format=MappedDataFormat)

Which I would pass to a transformer:

def _1(data: UnmappedDirectoryFormat) ->  MappedDirectoryFormat
  '''Read unmapped data and then perform mapping operations to get new table mapped_data'''

ff = MappedDirectoryFormat
with as fh:
mapped_data.write(fh, format=ReadyToUseData
return ff

(Evan Bolyen) #6

(Evan Bolyen) #7

Hi @zhaiman,

So sorry for the delay!

Yes what you have there is the essence of the idea! Let me know if you need any assistance (happy to schedule a video-call offline), but I think you have the hang of it.

You can also directly access those properties on the UnmappedDirectoryFormat via data.mapping and data.unmapped_data which have .view and .format methods/properties (.view will invoke transformers just like .write does, so you can subdivide your transformers (if that is useful, it’s probably too much work for just two TSV files)).

(Evan Bolyen) #8