Text handling within numeric types for q2-metadata


(Antonio Gonzalez Pena) #1

Issues:

  1. The International Nucleotide Database Collaboration have developed a standardised missing/null value reporting language, which says that missing values should be noted as: not applicable, not collected, not provided, restricted access; which means that numeric columns will have text if any value is missing and, in general, this will be the case as in an experiment we need blanks and controls that do might not have a valid column; for example timepoint_numeric (a simple time point representation).
  2. the #q2:types and #q2:units are not part of the INSDC so all data stored or downloaded from there will not have them and adding them correctly might be a daunting task

Possible solutions and comments:

  • q2-metadata is aware of these null values and they get ignored to define the correct #q2:types. Problem here is that if the standard changes.
  • q2-metadata only uses the values that q2-X (other plugin) is gonna operate on. In a situation like in the example in problem 1, we could compute the #q2:types on the subset of samples that q2-X will operate on; in other words, if my distance matrix doesn’t have any control/blank samples it should compute that #q2:types is numeric. This will mean that the plugins will need to figure out which samples are present in their input, pass it to q2-metadata so it does its magic.
  • Other?

Sorry if confusing and/or long, just trying to be as clear as possible.


(Greg Caporaso) #2

@antgonza, we’re chatting about some ideas for how to support this use case. Will get back to you shortly on it.


(Evan Bolyen) #3

Hey @antgonza,

We all had a chance to chat about this yesterday, I think we have a general strategy:

We add a new directive #q2:<some tag> which instructs QIIME 2 to use the INSDC vocabulary for missing values. We’re kind of thinking it would be easiest if this was a “global-to-the-file” directive so that you only have to add one little bit to the file, rather than adding this to each relevant column, but we could also do this per-column if that seems more useful.

We’d also expect this to work for categorical data as well.

Additionally, we’re thinking we should finish implementing an API for adding/changing directive annotations on metadata so that systems like Qiita can update the representation on the fly (when they know they are pulling from EBI resource for example).

Open questions:

  • What is the tag called and how do we make it look like a “global” directive? e.g. would something like this make sense: #q2:missing-vocabulary:INSDC (we’ve never had a directive take arguments, but this could make sense I think).
  • How do we represent these missing values once they have been loaded? Do we just convert them to NaN for the moment? How do these missing values impact downstream statistics, is there anything smarter that can be done knowing the kind of missing value it is? Or is NaN sufficient?
  • We’re assuming that adding an API for adding/removing these directives on a Metadata object (so that you could use directly, or just .save() for the CLI) would be useful for Qiita, but let us know if that isn’t the case.