Best way to merge or group runs/samples

I am currently working to combine two Illumina MiSeq runs for the same samples to obtain higher reads for further analysis and am confused about which functions within the feature-table merge or feature-table group plugins might be best.

Ideally, I would like to combine both runs and then group my variables into meaningful categories. I am looking at environmental microbial ecology and as an example I would like to combine leaf sample 1 run1 and leaf sample 2 run 2 then group all of my leaf samples together as one variable, leaf. Also, I have triplicates for some of my samples (i.e., leaf1-1, leaf1-2, leaf1-2; run 1) and it has been suggested to me that I should average these triplicates.

For feature-table merge, can you expand on what exactly is occurring with each of the --p-overlap methods; sum, error_on_overlapping_sample, and error_on_overlapping_feature? And could you do the same for the --p-mode functions in feature-table group; sum, mean-ceiling, median-ceiling?

I’ve looked at the table outputs for multiple of these methods and I am still unsure of how those sequence count numbers are actually computed. Any insight into these functions and into the best methods for combining my run data is greatly appreciated!

Thanks!

Hello again, Sarah. I hope your 2018 is off to a good start.

Like I said before, when I combine samples from two miseq runs, I will keep them separate so that I can include a metadata variable like MiSeqRunNumber and use it to detect batch effects. So I would demultiplex these all as separate samples, and process them in a single, unified batch to make a single feature abundance table. At the end, I could combine them into the categories you described using this Qiime 2 plugin:
https://docs.qiime2.org/2017.12/plugins/available/feature-table/group/
This page also describes --p-mode option you were asking about.


The feature-table merge plugin is needed when you have processed two batches of samples into two tables, and now you want to combine these tables. This is sort of risky because if you have two different samples that happen to have the same name, this plugin could merge them by mistake! To prevent this problem, this plugin starts by checking to see if any of the sample names or feature names overlap. If it finds overlapping names, you can choose what it does next (like sum these two samples, or throw an error about the identical name).


Let's zoom out a bit. You have lots of samples, and you want to group them in different ways. The major choice here is to either:

  1. process them all in one batch, then make meaningful categories using feature-table group or
  2. process them in many batches (one for each meaningful category), then feature-table merge when you want to see the full study.

2024 update:

Both options are okay, but one works better for DADA2!

Because DADA2 builds an error model for each sequencing run, you should run DADA2 on each sequencing run separately, then merge the tables like in option 2.

If you want to 'collapse/merge/sum' replicates later on, you can use option 1.

They serve different purposes so Qiime2 supports both! :qiime2:

2 Likes

Hi Colin,

My 2018 is starting off great (just QIIME2ing away :slight_smile: ), I hope yours is!

Thanks for your response, but I still have some questions about the functions I mentioned before.

In terms of the mean and median ceiling functions, does that provide the maximum mean or median for sequence counts found in a certain sample? What would be the benefit of using one of these functions?

For the sum function, that seemed to have just added together the sequence counts for samples from each of my runs or categories, but wouldn't that overestimate the features of a sample if they are just added together? Say I want to group all of my adult frog samples together and I have three adults whose same samples were run on two runs. I add those together by grouping them using the sum function. Would that result in possibly assigning a higher abundance of a certain feature due to it being found in each sample?

For feature-table merge..

does this mean that the error_on_overlapping_feature function would simply output an error if you have features from each sample that overlap?

Thanks for outlining the workflow for merging or grouping samples together. I now am more confident in the choice of proceeding with option 1.

Thanks!
-Sarah

Hi @Sarah_McGrath! To answer your questions about what the different parameter choices for merge and group mean, I think some examples might help illustrate the concepts:

merge

from biom import Table
import numpy as np
from qiime2 import Artifact
from qiime2.plugins import feature_table

t1 = Artifact.import_data('FeatureTable[Frequency]',
                          Table(np.array([[0, 1, 3], [1, 1, 2]]),
                                         ['O1', 'O2'], ['S1', 'S2', 'S3']))
t2 = Artifact.import_data('FeatureTable[Frequency]',
                          Table(np.array([[0, 2, 6], [2, 2, 4]]),
                                ['O1', 'O3'], ['S1', 'S5', 'S6']))

Those two tables look like this:

# Constructed from biom file
#OTU ID	S1	S2	S3
O1	0.0	1.0	3.0
O2	1.0	1.0	2.0
# Constructed from biom file
#OTU ID	S1	S5	S6
O1	0.0	2.0	6.0
O3	2.0	2.0	4.0

Above, we create two FeatureTables, note that each table has an S1 sample, and an O1 feature present.

feature_table.methods.merge([t1, t2], overlap_method='error_on_overlapping_sample')
...
ValueError: Same samples are present in some of the provided tables: S1

error_on_overlapping_sample is complaining about the duplicate sample in both tables, S1.

feature_table.methods.merge([t1, t2], overlap_method='error_on_overlapping_feature')
...
ValueError: Same features are present in some of the provided tables: O1

error_on_overlapping_feature is complaining about the duplicate feature in both tables, O1.

t3, = feature_table.methods.merge([t1, t2], overlap_method='sum')
print(t3.view(Table))

# Constructed from biom file
#OTU ID	S1	S2	S3	S5	S6
O1	0.0	1.0	3.0	2.0	6.0
O2	1.0	1.0	2.0	0.0	0.0
O3	2.0	0.0	0.0	2.0	4.0

The merging doesn’t complain about the overlapping sample or feature from above, but rather sums the values anywhere that there is an overlap.


group

For grouping, taking the ceiling of a value means to round up. So, when you group on a metadata value, and select something like median or mean, you might wind up with a non-whole number, which doesn’t really make sense when considering the nature of an observation matrix. The ceiling means that after those values are computed (median; mean), we round the value up to the nearest whole number. We don’t need to worry about rounding when performing an operation like sum, because that will always result in a whole number.

import qiime2
import pandas as pd
import biom

sample_mc = qiime2.CategoricalMetadataColumn(pd.Series(['x', 'y', 'y'], name='foo',
                                             index=pd.Index(['a', 'b', 'c'], name='sampleid')))
table = qiime2.Artifact.import_data('FeatureTable[Frequency]',
                                    biom.Table(np.array([[1, 2, 3], [30, 20, 10]]),
                                               sample_ids=sample_mc.to_series().index,
                                               observation_ids=['O1', 'O2']))
# Constructed from biom file
#OTU ID	a	b	c
O1	1.0	2.0	3.0
O2	30.0	20.0	10.0
t_sum, = feature_table.methods.group(table=table, axis='sample', metadata=sample_mc, mode='sum')
print(t_sum.view(biom.Table))

# Constructed from biom file
#OTU ID	x	y
O1	1.0	5.0
O2	30.0	30.0
t_median, = feature_table.methods.group(table=table, axis='sample', metadata=sample_mc, mode='median-ceiling')
print(t_median.view(biom.Table))

# Constructed from biom file
#OTU ID	x	y
O1	1.0	3.0
O2	30.0	15.0
t_mean, = feature_table.methods.group(table=table, axis='sample', metadata=sample_mc, mode='mean-ceiling')
print(t_mean.view(biom.Table))

# Constructed from biom file
#OTU ID	x	y
O1	1.0	3.0
O2	30.0	15.0

It looks like you asked some additional questions while I was writing this post - can you take a look at this, and follow up with any remaining questions, restated? You can just copy-and-paste. Thanks! :t_rex:

2 Likes

Hi @thermokarst!

Thanks for the examples! That really helps to see what each function does.

Follow up question:
For the sum function, that seemed to have just added together the sequence counts for samples from each of my runs or categories, but wouldn’t that overestimate the features of a sample if they are just added together? Say I want to group all of my adult frog samples together and I have three adults whose same samples were run on two runs. I add those together by grouping them using the sum function. Would that result in possibly assigning a higher abundance of a certain feature due to it being found in each sample?

In short, are there any issues with downstream analysis when summing sample sequence counts?

Thanks,
-Sarah

Hi @Sarah_McGrath,

This would only worry me if you suspect any major batch effects (as @colinbrislawn raises), or if those runs were processed with different protocols, such that you will have different sequence variants in each run. Otherwise, summing should not have a major impact on diversity metrics — which will be normalized, e.g., by even sampling, prior to estimation — on taxonomic composition — which will rely on relative abundances — or on differential abundance analyses like ANCOM, in which features are also normalized prior to testing.

I hope that helps!

2 Likes

Hi @Nicholas_Bokulich.

Thanks for clarifying. I used the same protocol for each run and do not suspect major batch effects. My first run just got lower sequence counts than I would have liked so I ran it again to obtain higher sequence numbers.

Could you provide an example in which it would be better to use the mean and median ceiling? What benefit would that provide over summing?

Thanks,
-Sarah

To be honest, I'm not certain. Summing seems to make the most sense to me for standard uses, since the data will be normalized downstream anyway (e.g., even rarefaction prior to alpha diversity analysis).

I suppose if you suspected that certain samples within a group are outliers that would skew the mean, you might want to use median-ceiling instead to help control that.

Perhaps mean/median would also make the most sense if the frequencies will be used directly without normalization. For example, if a feature table contained quantitatively informative frequencies (e.g., metabolite concentrations or microbe abundances calculated using an absolutely quantitative method). Then summing those features would not make sense if you want to then look at the mean/median abundance of features in each sample type. But this would not really concern most current uses with microbiome sequencing data.

3 Likes

Thank you for your response!

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.