Questions regarding ANCOM input and output

heimer · December 21, 2017, 12:55pm

Hello community, and thank you very much for this wonderful tools and discussion. I'm really learning a lot.
I have two questions regarding differential abundance analysis performed by ANCOM implemented in QIIME2.

I have a raw abundance table, and I just summarized this table for genus using tax_glom() implemented in phyloseq, so I have now genus-level absolute abundance table. Is it safe to use this summarized table as input to ANCOM or DESeq2? I think that we should input "raw" count table for these algorithm, but summarized table can also be regarded as "raw"?
I performed ANCOM analysis and got output as percentile table. Can I extract whole ANCOM-normalized table from the output? I also tried ANCOM implemented in Python library scikit-bio, but there also seems no function that can extract.

Sincerely,
heimer

colinbrislawn · December 21, 2017, 5:34pm

Good question! I would like to know the answer too.

mortonjt · December 31, 2017, 7:43am

Hi @heimer. Concerning your first question, yes, the percentiles are calculated on the raw percentiles. However one should be careful about feeding in raw values, since ANCOM does not play nicely with zeros. Some filtering is necessary before running ANCOM to limit the number of low abundance features that are being fed in. But ANCOM is fairly robust to library-size differences (see this paper), so rarefaction generally isn't necessary.

Concerning the second question.

Short answer: The ANCOM procedure doesn't produce intermediate normalized features.

Long answer: The ANCOM inference procedure is tightly coupled with a form of data normalization. The main idea behind ANCOM that the log ratios automatically normalize for sequencing depth.

Think of it this way.

If there are D organisms, for a given organism i, ANCOM will perform D-1 hypothesis tests. An example of a hypothesis test is as follows

H_{0ij}: \mu_{ij,c_1} = \mu_{ij,c_2}

where c_1 and c_2 are different classes being tested for (i.e. treatment and control) and organism j is a different organism that is being compared to organism i. In addition, the means are calculated as follows.

\mu_{ij, c_1} = \frac{1}{|c_1|}\sum\limits_{x \in c_1} \ln \frac{x_i}{x_j}

where x are the samples from class c_1, x_i represents the microbial counts of organism i in sample x and x_j are the microbial counts of organism j in sample x.

One thing to note about this is the log ratios do perform a type of normalization.

\mu_{ij, c_1} = \frac{1}{|c_1|}\sum\limits_{x \in c_1} \ln \frac{x_i}{x_j} = \frac{1}{|c_1|}\sum\limits_{x \in c_1} \ln \frac{n_xp_i}{n_xp_j} = \frac{1}{|c_1|}\sum\limits_{x \in c_1} \ln \frac{p_i}{p_j}

Here, n_x is the sequencing depth for sample x. As you can see, the sequencing depth drops out. However, there is no intermediate normalized data, since the test statistics are directly calculated from the data.

There are a couple of steps to aggregate the hypothesis tests into the W-statistic that is reported in the output. The details behind the actual algorithm is located in the supplemental materials in the ANCOM paper

system · January 31, 2018, 9:04pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.