Questions regarding ANCOM input and output

mortonjt · December 31, 2017, 7:43am

Hi @heimer. Concerning your first question, yes, the percentiles are calculated on the raw percentiles. However one should be careful about feeding in raw values, since ANCOM does not play nicely with zeros. Some filtering is necessary before running ANCOM to limit the number of low abundance features that are being fed in. But ANCOM is fairly robust to library-size differences (see this paper), so rarefaction generally isn't necessary.

Concerning the second question.

Short answer: The ANCOM procedure doesn't produce intermediate normalized features.

Long answer: The ANCOM inference procedure is tightly coupled with a form of data normalization. The main idea behind ANCOM that the log ratios automatically normalize for sequencing depth.

Think of it this way.

If there are D organisms, for a given organism i, ANCOM will perform D-1 hypothesis tests. An example of a hypothesis test is as follows

H_{0ij}: \mu_{ij,c_1} = \mu_{ij,c_2}

where c_1 and c_2 are different classes being tested for (i.e. treatment and control) and organism j is a different organism that is being compared to organism i. In addition, the means are calculated as follows.

\mu_{ij, c_1} = \frac{1}{|c_1|}\sum\limits_{x \in c_1} \ln \frac{x_i}{x_j}

where x are the samples from class c_1, x_i represents the microbial counts of organism i in sample x and x_j are the microbial counts of organism j in sample x.

One thing to note about this is the log ratios do perform a type of normalization.

\mu_{ij, c_1} = \frac{1}{|c_1|}\sum\limits_{x \in c_1} \ln \frac{x_i}{x_j} = \frac{1}{|c_1|}\sum\limits_{x \in c_1} \ln \frac{n_xp_i}{n_xp_j} = \frac{1}{|c_1|}\sum\limits_{x \in c_1} \ln \frac{p_i}{p_j}

Here, n_x is the sequencing depth for sample x. As you can see, the sequencing depth drops out. However, there is no intermediate normalized data, since the test statistics are directly calculated from the data.

There are a couple of steps to aggregate the hypothesis tests into the W-statistic that is reported in the output. The details behind the actual algorithm is located in the supplemental materials in the ANCOM paper