This may end up being moved to General Discussion, but I had a question regarding log transformations performed by the ANCOM plugin. Are these transformations truly centered log ratio transformations (clr) or are they additive log ratios (alr)? If I understand correctly, clr uses the geometric mean as the reference and alr uses a particular component as a reference.
I've come across papers that cite ANCOM performs alr transformations (Gloor et al., Nearing et al., Hu et al.) whereas differential abundance tests like ALDEx and ALDEx2 use clr. However, the QIIME2 Parkinson's tutorial specifically states the centered log transformation is used with ANCOM, as do some forum posts like this one, and the ANCOM plugin "--p-transform-function" text options include "clr" but not "alr". The original paper by Mandal et al does not explicitly say centered log transformations were used, but does mention compositional log transformations which would also be abbreviated clr...
I'm far from a statistician or mathematician, but would like to better understand this discrepancy. Could anyone provide some clarity on what's actually being performed with the plugin?
This is true, compositional data analysis is quite messy, and statisticians tend to use heterogeneous terminology for the operations they perform.
ALR basically utilizes one feature (sequence) as a reference frame. It is a common practice in RNA-Seq, because we can select consistently expressed genes called "housekeeping genes".
In the microbiome, this will not work because we don't have information about microbes with a stable abundance anywhere. Therefore it's hard to select a meaningful reference.
In cases you don't know what the plugin does, it's useful to take a look at the code, as it's open-source software.
Let's see:
ANCOM imports CLR function from skbio package and later uses it for data normalization.
The transform underlying the ANCOM I test is an ALR. ANCOM takes the pair of species, calculates the log ratio, and applies the statistical test. The W statistic is calculated as the total (or percent) of species that are significantly different after FDR correlation at a threshold set a priori.
Within QIIME 2, the visualization needs to be calculated on transformed data. So, I think the --p-transform-function dictates the transform applied to calculate the effect size for the volcano plot.