Semantic type for model matrix

dgiguer · October 21, 2019, 7:51pm

Hi all,

I'm working on adding the glm capability for the ALDEx2 plugin. For the glm function, a model matrix is required as input that looks something like such (default output from R, may have more columns and/or rows):

(Intercept)	A	B
1	0	0
1	1	0
1	0	0
1	0	0
1	0	0
1	1	0
1	0	0
1	1	1
1	1	1
1	1	1
1	1	1
1	1	1
1	0	1
1	0	1

I'm wondering if there is currently a semantic type available to support this type of data, or if I would need to create a new one.

Cheers,

Dan

ebolyen · October 21, 2019, 8:34pm

Hey @dgiguer!

We were just talking about this the other day! Well almost, we were discussing whether or not to make formulas (Wilkinson-Rogers/"R" style) a primitive type in QIIME 2 (via something like patsy). We were also thinking that once that formula is "bound" to a scope (e.g. one or more metadata files/artifacts), it should be possible to view it as a model matrix directly.

I believe a few actions take this formula approach. Would a formula work for ALDEx2, or are there typical model matrices that don't have an equivalent representation in formula notation?

dgiguer · October 21, 2019, 9:18pm

Thanks for the quick reply @ebolyen!

I think a formula would work as long as the model matrix could be viewed directly and stored as a temp file itself. Currently for ALDEx2, i would need to save the matrix as as a file to be read into R in the background.

Do you happen to know any examples of the formula being used for this purpose?

ebolyen · October 21, 2019, 9:31pm

Hey @dgiguer,

I'm not aware of anyone doing this specifically, but I think the following should do the trick:

In [1]: import pandas as pd                                                                                     

In [2]: import patsy                                                                                            

In [3]: df = pd.DataFrame(patsy.demo_data('a', 'b', 'y'))                                                       

In [4]: df                                                                                                      
Out[4]: 
    a   b         y
0  a1  b1  1.764052
1  a1  b2  0.400157
2  a2  b1  0.978738
3  a2  b2  2.240893
4  a1  b1  1.867558
5  a1  b2 -0.977278
6  a2  b1  0.950088
7  a2  b2 -0.151357

In [5]: patsy.dmatrix('a + b', df, return_type='dataframe')                                                     
Out[5]: 
   Intercept  a[T.a2]  b[T.b2]
0        1.0      0.0      0.0
1        1.0      0.0      1.0
2        1.0      1.0      0.0
3        1.0      1.0      1.0
4        1.0      0.0      0.0
5        1.0      0.0      1.0
6        1.0      1.0      0.0
7        1.0      1.0      1.0

Once you have the design matrix as a dataframe, you can pick your favorite serialization scheme for pulling it into R.