I'm having trouble understanding the --p-formula
parameter of qiime composition ancombc
.
Without getting into much detail, my metadata file looks like this (all columns are categorical):
sample-id var1 var2 var3 ...
S1_1 A Y UP
S1_2 A Y UP
S1_3 A Y UP
S2_1 B N DOWN
S2_2 B N DOWN
S2_3 B N DOWN
S3_1 C Y DOWN
S3_2 C Y DOWN
S3_3 C Y DOWN
...
My biological question is how each condition (metadata column) affects diversity (i.e, are there differences? Which condition explains better the differences, if any?). In order to answer those, I ran beta diversity measures (let's suppose I only ran e.g. Bray-Curtis so I simplify the question). Then, for each metadata column I run PERMANOVA. PERMANOVA is significant, say, for columns var1, var2 and var3. Now I want to know which ASVs are responsible for these significant beta diversity diferences. So I run ANCOM-BC for each metadata column, using the name of the column in --p-formula
(e.g. --p-formula var1
). Until here, I'm okay.
In the ANCOM-BC docstring, a multi formula example with --p-formula 'bodysite + animal'
is given. Is this testing differential abundance considering both metadata columns? In my example, when I run ANCOM-BC with e.g. --p-formula var2
and then generate the da-barplot, I obtain a QZV that links to results for var2N (var2 with Y as reference). However, if I run ANCOM-BC with --p-formula "var1 + var2"
the QZV contains three links: var1B, var1C and var2N. I supposed using --p-formula "var1 + var2"
would result in the same output that running individually var1 and var2. However, the var2N is not the same in the two cases (I obtain much more depleted ASVs with the "complex" formula).
So maybe I unwittingly used a multivariate model? (Sorry if that is completely wrong, I need to learn biostatistics ASAP). So when looking at "var1 + var2", e.g. the var2N abundance plot, that is telling me ASVs differentially abundand in var2::N samples compared with var2::Y samples, but where does the "+ var1" part come in?
Maybe if I'm trying to test multiple metadata columns to check how they explain my data I should just use bioenv instead of keep playing around with formulas?
I'm trying to learn something with GUSTA ME, but I'm not sure if that is exactly what I need here.
Many thanks in advance,
Sergio