[q2-metadata] how to use default_missing_scheme and column_missing_schemes

Good morning!

I'm using

In [1]: import qiime2

In [2]: qiime2.__version__
Out[2]: '2022.8.3'

and I'm having trouble understanding how the default_missing_scheme and column_missing_schemes parameters work in qiime2.Metadata.load method. By the way, I was using Metadata — QIIME 2 Developer Documentation documentation as a reference and the file I'm using to test this functionality is: example.txt (232 Bytes).

As expected, when using the default, everything is categorical:

In [3]: qiime2.Metadata.load('example.txt')
Out[3]: 
Metadata
--------
6 IDs x 4 columns
categorical: ColumnProperties(type='categorical', missing_scheme='blank')
numeric:     ColumnProperties(type='categorical', missing_scheme='blank')
mixed:       ColumnProperties(type='categorical', missing_scheme='blank')
other:       ColumnProperties(type='categorical', missing_scheme='blank')

Call to_dataframe() for a tabular representation.

but I would have expected that using default_missing_scheme='INSDC:missing' would detect that cells with not provided are blanks and columns with those values and numeric will be autodetected as numeric columns but it doesn't:

In [4]: qiime2.Metadata.load('example.txt', default_missing_scheme='INSDC:missing')
Out[4]: 
Metadata
--------
6 IDs x 4 columns
categorical: ColumnProperties(type='categorical', missing_scheme='INSDC:missing')
numeric:     ColumnProperties(type='categorical', missing_scheme='INSDC:missing')
mixed:       ColumnProperties(type='categorical', missing_scheme='INSDC:missing')
other:       ColumnProperties(type='categorical', missing_scheme='INSDC:missing')

Call to_dataframe() for a tabular representation.

I tried using column_missing_schemes but not sure if it should be combined with default_missing_scheme of used individually.

Any guidance will be appreciated.

Thank you.

2 Likes

Hey @antgonza!

I think your expectation is correct, those should have become numeric automatically. I'm going to double check things on our end (I was pretty sure we tested exactly this), but in the meantime could you send a representative file with this situation? sorry will check your example, thanks for providing!.

Re:

Yep! Those are designed to be used together, the default_missing_scheme will be a fallback for anything not explicitly mentioned by column_missing_scheme.

The precedence order (from greatest to least) should be:

column_missing_scheme > within file annotation q2:missing > default_missing_scheme

I'll report back with what I find soon.

EDIT:
missed the example file somehow. Sorry!

Yeah, this appears to be a bug in the precedence order.

Setting the q2:missing in the file works as expected, but you should have been able to override the interpretation with default_missing_scheme. I'm going to look into the logic a bit more to see if this is quick to fix.

Alright, this will be easy to fix.

There is logic for handling the cast of the column with knowledge of the missing scheme. It just turns out that default_missing_scheme was left out of the party.

That means that initially, the following hack will work (it involved a double-load unfortunately):

In [3]: cols = Metadata.load('Downloads/example.txt').columns

In [4]: Metadata.load('Downloads/example.txt', 
                      column_missing_schemes={
                          c: 'INSDC:missing' for c in cols})
Out[4]: 
Metadata
--------
6 IDs x 4 columns
categorical: ColumnProperties(type='categorical', missing_scheme='INSDC:missing')
numeric:     ColumnProperties(type='numeric', missing_scheme='INSDC:missing')
mixed:       ColumnProperties(type='categorical', missing_scheme='INSDC:missing')
other:       ColumnProperties(type='categorical', missing_scheme='INSDC:missing')

Thanks @ebolyen, I'll do that!

Sounds good!

I've created a PR fixing this bug:

So you should be able to drop that hack come next release as the default_missing_scheme parameter will behave correctly.

1 Like