ANCOM-BC formula when using multiple metadata columns

I'm having trouble understanding the --p-formula parameter of qiime composition ancombc.

Without getting into much detail, my metadata file looks like this (all columns are categorical):

sample-id    var1    var2    var3    ...
     S1_1       A       Y       UP
     S1_2       A       Y       UP
     S1_3       A       Y       UP
     S2_1       B       N       DOWN
     S2_2       B       N       DOWN
     S2_3       B       N       DOWN
     S3_1       C       Y       DOWN
     S3_2       C       Y       DOWN
     S3_3       C       Y       DOWN
      ... 

My biological question is how each condition (metadata column) affects diversity (i.e, are there differences? Which condition explains better the differences, if any?). In order to answer those, I ran beta diversity measures (let's suppose I only ran e.g. Bray-Curtis so I simplify the question). Then, for each metadata column I run PERMANOVA. PERMANOVA is significant, say, for columns var1, var2 and var3. Now I want to know which ASVs are responsible for these significant beta diversity diferences. So I run ANCOM-BC for each metadata column, using the name of the column in --p-formula (e.g. --p-formula var1). Until here, I'm okay.

In the ANCOM-BC docstring, a multi formula example with --p-formula 'bodysite + animal' is given. Is this testing differential abundance considering both metadata columns? In my example, when I run ANCOM-BC with e.g. --p-formula var2 and then generate the da-barplot, I obtain a QZV that links to results for var2N (var2 with Y as reference). However, if I run ANCOM-BC with --p-formula "var1 + var2" the QZV contains three links: var1B, var1C and var2N. I supposed using --p-formula "var1 + var2" would result in the same output that running individually var1 and var2. However, the var2N is not the same in the two cases (I obtain much more depleted ASVs with the "complex" formula).

So maybe I unwittingly used a multivariate model? (Sorry if that is completely wrong, I need to learn biostatistics ASAP). So when looking at "var1 + var2", e.g. the var2N abundance plot, that is telling me ASVs differentially abundand in var2::N samples compared with var2::Y samples, but where does the "+ var1" part come in?

Maybe if I'm trying to test multiple metadata columns to check how they explain my data I should just use bioenv instead of keep playing around with formulas?

I'm trying to learn something with GUSTA ME, but I'm not sure if that is exactly what I need here.

Many thanks in advance,

Sergio

Hello again Sergio,

Another banger of a question! I'm not a card-carrying statistician, so check my work.

Your setup makes sense to me. After finding global differences in beta-diversity, you look to DA testing to see what drives these differences.

The q2-composition plugin is downstream. Let's go to the source!

The paper: Analysis of compositions of microbiomes with bias correction - PMC
The tutorial for the R package: ANCOM-BC Tutorial

Yes, both columns are used.
Yes, this is multivariate.
Yes, those who do not learn statistical methods are doomed to (re)invent them.

First, we must learn how the R programming language uses a formula to represent a statistical model.

From the ancom-bc R tutorial,
formula = "age + region + bmi"
age is a continuous variable
region and bmi are categorical factors

Result from the ANCOM-BC log-linear model to determine taxa that are differentially abundant according to the covariate of interest.

Note how in the results, age is shown as its own column while factors are shown only by comparison between levels ( "Overweight - Obese", "Lean - Obese").

You saw this in your data:

So your var1 is a factor with the levels c(B, C, N).
Because it's a factor, you can compare levels against each other or a single 'baseline' level.

Note too that factor of interest is the last item in the formula.

formula = "age + region + bmi"
#      we care most about ^^^ so it goes last

There is a lot to unpack here, so let us know if you have more questions!

2 Likes

Hello @colinbrislawn ,

Thank you so much! I think now I understand it more (at least, more than before asking :sweat_smile:).

I'm trying to translate the dummy example I made up for the post into my actual data, and I think I got it, but some details still remain unclear for me :frowning:

My toy example was (so I save you a bit of scrolling):

Okay, now let's suppose I run ANCOM-BC with the following formula: var1 + var2. var1 has levels c(A, B, C) and var2 has levels c(Y, N) (it's a yes/no variable). I care most about var2 in this case (I'll come back to that later).

If I tabulate ANCOM-BC results to see the LFC table, I'll see a message like the following:

Groups used to define the intercept:
var1::A, var2::Y (and any numerical metadata columns)

So the intercept (i.e., reference) is taking into account both conditions. So:

  • Samples with var1::A --> Sample 1
  • Samples with var2::Y --> Samples 1 and 3
  • Samples taken into account for reference --> the union of the above: S1 and S3

I'm not sure this is actually how it works, but it makes sense to me.

And the other thing I don't know I understood correctly is that there may be a sample both in reference and comparing group. For example, I obtain three LFC columns: var1::B, var1::C and var2::N. Let's take var1::C.

  • Reference samples --> Samples with var1::A and va2::Y (S1 and S3)
  • Samples I'm testing --> Samples with var1::C (S3 again!)

I suppose this makes sense, but it surprised me a bit. So just to make sure that is normal.

And the last thing I didn't really get (sorry! I'm question-bombing you) is:

What is the effect or variable order? In my example, wouldn't I get the same three LFC columns when using var1 + var2 or var2 + var1?

Many thanks for your support.

Sergio

--

I found a couple of biostatistics resources, I hope I understand better these things soon.

That's right.

I think the Qiime2 plugin selects a baseline group to test against. Maybe by what is first in the metadata?
So for var1, A becomes baseline and for var2, Y becomes baseline.

var1::A is missing, because it's the baseline
var2::Y is missing, because it's the baseline

You can also set a baseline explicitly using the plugin. There may be even more options in the R package.

This does not sound right... :thinking:

I think results for all samples are reported, including those in the baseline cohort.


I tried to find official R docs on this and could not. :upside_down_face:

Some tests are invariant to order, so their formulas are too. Easy!
Some packages implement tests in which the order of variables matters, like vegan::adonis().

Try it out for yourself!

1 Like

Exactly! I'm omitting that and just picking the first value in my examples in order to not overcomplicate my question. In my own data I know the reference levels I want to use and I specify them to ANCOM-BC.

I know, but I cannot imagine how that works if my example with var1::C is wrong :frowning:

I tried in my data and results are the same, it only changes the column order (not the values). So we are fine with either formula order.

1 Like