Best approach to ID Mapping

zhaiman · October 15, 2018, 8:33pm

I am trying to map InChI strings to new metabolite identifiers for a plugin I am developing.

I have two TSV files of the following formats:

A data file where column 1 contains InChI Strings, column 2 contains numerical values.
A mapping file where column 1 contains InChI Strings, column 2 contains new identifiers.

The goal is to end up with an artifact containing a table where column 1 are the new IDs and column 2 are the corresponding numerical values
(This artifact will be the input to a visualizer)

My original idea was to define two SemanticTypes and a transformer to produce the desired result as a third SemanticType.
(e.g. InChIDataValues + InChIMapping --> transfomer --> NewID)

However, it seems that Qiime2 Metadata can also be used for mapping purposes, and I have yet to see an example of a transformer that turns two inputs into one output.

What would be the ideal way to develop this mapping pipeline?

ebolyen · October 19, 2018, 9:06pm

Hi @zhaiman!

Sorry for the delayed response!

I was just talking to @mwang87 earlier this week about a similar situation in his plugin.

I guess I have a question to figure out what the best way forward might be:

In the mapping TSV #2, are the new IDs something a user provides or are they calculated by your plugin?

If they are user-defined, then I think Metadata is the way to go. This process sounds very similar to feature-table group which can be used to relabel IDs (if your groups happen to be a 1-to-1 map).

Otherwise, another avenue would be a Directory format which contains both of these files. They would be accessible as a single input for a transformer, because as you correctly observed, transformers are limited to single input and single output. Although I’m not sure how the transformer itself would fit into your overall pipeline here.

zhaiman · October 19, 2018, 10:59pm

Hi @ebolyen,

Thank you for the response!

At this point in time, the user will not need to provide any IDs. Most mappings are 1-to-1, but occassionally there may be one InChI mapping to multiple IDs.

If I were to use a Directory format that contains both TSV files, would the following approach work?

class UnmappedDirectoryFormat(model.DirectoryFormat):
    mapping = model.File(r'mapping_file.tsv', format=MapFormat)
    unmapped_data = model.File(r'unmapped_data.tsv', format=UserDataFormat)

model.SingleFileDirectoryFormat(
'MappedDirectoryFormat',  mapped_data.tsv', format=MappedDataFormat)

Which I would pass to a transformer:

def _1(data: UnmappedDirectoryFormat) ->  MappedDirectoryFormat
  '''Read unmapped data and then perform mapping operations to get new table mapped_data'''

ff = MappedDirectoryFormat
with ff.open() as fh:
mapped_data.write(fh, format=ReadyToUseData
return ff

ebolyen · October 25, 2018, 11:39pm

Hi @zhaiman,

So sorry for the delay!

Yes what you have there is the essence of the idea! Let me know if you need any assistance (happy to schedule a video-call offline), but I think you have the hang of it.

You can also directly access those properties on the UnmappedDirectoryFormat via data.mapping and data.unmapped_data which have .view and .format methods/properties (.view will invoke transformers just like .write does, so you can subdivide your transformers (if that is useful, it’s probably too much work for just two TSV files)).