Best way to merge or group runs/samples

thermokarst · February 5, 2018, 5:08pm

Hi @Sarah_McGrath! To answer your questions about what the different parameter choices for merge and group mean, I think some examples might help illustrate the concepts:

merge

from biom import Table
import numpy as np
from qiime2 import Artifact
from qiime2.plugins import feature_table

t1 = Artifact.import_data('FeatureTable[Frequency]',
                          Table(np.array([[0, 1, 3], [1, 1, 2]]),
                                         ['O1', 'O2'], ['S1', 'S2', 'S3']))
t2 = Artifact.import_data('FeatureTable[Frequency]',
                          Table(np.array([[0, 2, 6], [2, 2, 4]]),
                                ['O1', 'O3'], ['S1', 'S5', 'S6']))

Those two tables look like this:

# Constructed from biom file
#OTU ID	S1	S2	S3
O1	0.0	1.0	3.0
O2	1.0	1.0	2.0
# Constructed from biom file
#OTU ID	S1	S5	S6
O1	0.0	2.0	6.0
O3	2.0	2.0	4.0

Above, we create two FeatureTables, note that each table has an S1 sample, and an O1 feature present.

feature_table.methods.merge([t1, t2], overlap_method='error_on_overlapping_sample')
...
ValueError: Same samples are present in some of the provided tables: S1

error_on_overlapping_sample is complaining about the duplicate sample in both tables, S1.

feature_table.methods.merge([t1, t2], overlap_method='error_on_overlapping_feature')
...
ValueError: Same features are present in some of the provided tables: O1

error_on_overlapping_feature is complaining about the duplicate feature in both tables, O1.

t3, = feature_table.methods.merge([t1, t2], overlap_method='sum')
print(t3.view(Table))

# Constructed from biom file
#OTU ID	S1	S2	S3	S5	S6
O1	0.0	1.0	3.0	2.0	6.0
O2	1.0	1.0	2.0	0.0	0.0
O3	2.0	0.0	0.0	2.0	4.0

The merging doesn't complain about the overlapping sample or feature from above, but rather sums the values anywhere that there is an overlap.

group

For grouping, taking the ceiling of a value means to round up. So, when you group on a metadata value, and select something like median or mean, you might wind up with a non-whole number, which doesn't really make sense when considering the nature of an observation matrix. The ceiling means that after those values are computed (median; mean), we round the value up to the nearest whole number. We don't need to worry about rounding when performing an operation like sum, because that will always result in a whole number.

import qiime2
import pandas as pd
import biom

sample_mc = qiime2.CategoricalMetadataColumn(pd.Series(['x', 'y', 'y'], name='foo',
                                             index=pd.Index(['a', 'b', 'c'], name='sampleid')))
table = qiime2.Artifact.import_data('FeatureTable[Frequency]',
                                    biom.Table(np.array([[1, 2, 3], [30, 20, 10]]),
                                               sample_ids=sample_mc.to_series().index,
                                               observation_ids=['O1', 'O2']))

# Constructed from biom file
#OTU ID	a	b	c
O1	1.0	2.0	3.0
O2	30.0	20.0	10.0

t_sum, = feature_table.methods.group(table=table, axis='sample', metadata=sample_mc, mode='sum')
print(t_sum.view(biom.Table))

# Constructed from biom file
#OTU ID	x	y
O1	1.0	5.0
O2	30.0	30.0

t_median, = feature_table.methods.group(table=table, axis='sample', metadata=sample_mc, mode='median-ceiling')
print(t_median.view(biom.Table))

# Constructed from biom file
#OTU ID	x	y
O1	1.0	3.0
O2	30.0	15.0

t_mean, = feature_table.methods.group(table=table, axis='sample', metadata=sample_mc, mode='mean-ceiling')
print(t_mean.view(biom.Table))

# Constructed from biom file
#OTU ID	x	y
O1	1.0	3.0
O2	30.0	15.0

It looks like you asked some additional questions while I was writing this post - can you take a look at this, and follow up with any remaining questions, restated? You can just copy-and-paste. Thanks!