I have run the ancombc2 experiment detailed below, but I am cautious to trust the results.
I find no associations (FDR<0.05) with any of my covariates for all taxa, which is out of line with my expectations - i.e., I strongly believe some taxa are associated with at least some covariates. Using alternative methods (such as mixed models on CLR-transformed relative abundance), I find associations that make sense biologically.
My steps involve:
Multiplying the relative abundance by the total number of reads per participant to get counts data
Scaling continuous covariates and setting batch to factor
Creating TSE object, with taxa as each entry in the named list assays and col_data containing my metadata, with samples as rows
Running the function, specifying random effects for repeated observations and group as my variable of interest.
Additional information:
Size df ~ 700
N.o. species ~550
Disease is binary
group_ID - the data are in pairs or singlets, therefore there are ~400 unique group_IDs and a maximum of two levels within each
I'm using R version 4.3.3 and ANCOMBC version 2.4.0
I get no warning messages other than one indicating that my group variable has < 3 levels
Am I going wrong somewhere in my approach?
What diagnostics can I perform to verify things are working as expected?
Any help appreciated!
## Convert relative abunance to counts
df[species] <- df[species] * df$number_of_reads
## Establish meta data
meta_data <- df[c('group_ID', 'batch', 'age', 'BMI', 'disease')] ; rownames(meta_data) <- df$sample_ID
# Scale continuous vars and make categorical vars factor
meta_data[, c('age', 'BMI')] <- scale(meta_data[, c('age', 'BMI')])
meta_data$batch <- as.factor(meta_data$batch)
#### Make tse object
# Create the assay (feature table as a matrix, taxa as rows)
assays <- list(counts = as.matrix(t(df[species])))
# Create colData (metadata)
col_data <- DataFrame(meta_data)
# Construct TreeSummarizedExperiment
tse <- TreeSummarizedExperiment(
assays = assays,
colData = col_data
)
# Run ANCOM-BC2 directly with data frames
ancom_result <- ancombc2(
data = tse,
fix_formula = "batch + age + BMI + disease",
rand_formula = "(1 | group_ID)",
p_adj_method = "BH",
pseudo_sens = TRUE,
prv_cut = 0, # 10% filtering already applied
group = "disease",
n_cl = 2, # number of cores
verbose = TRUE,
iter_control = list(tol = 1e-5, max_iter = 20, verbose = FALSE),
em_control = list(tol = 1e-5, max_iter = 100),
lme_control = lme4::lmerControl(check.nobs.vs.nRE = "warning"),
mdfdr_control = list(fwer_ctrl_method = "BH", B = 1000)
)
Sorry for the delayed response. I had a chance to review your code and didn’t notice any issues.
Just a quick note: ANCOM-BC2 can be conservative sometimes, especially when the sensitivity analysis is enabled. When you mentioned having no significant results, did you consider the outcomes from the sensitivity analysis?
If so, and you're concerned about statistical power, it’s reasonable to report taxa that were significant before sensitivity analysis, while clearly noting this in your manuscript. We've followed this approach in our own papers to help preserve power and provide context.
Thanks for the update! It sounds like the sensitivity analysis wasn’t the issue after all.
It’s worth noting that ancombc2 can indeed be more conservative—especially when the mixed effects model is used, as we mentioned in our original paper. Just for comparison, could you try running ancombc without including the random intercept and see how the results differ?
Also, I wanted to highlight the structural zeros testing. You may find some significant taxa based on presence/absence patterns, rather than abundance, which could be insightful too.
Thanks again for taking the time to help me. I really appciate it.
I wans't sure if you were specifically referring to ancombc (i.e., the old function) or whether you meant still ancombc2 but with random effects removed.
In any case, I tried both and while q-values were lower, there was still nothing in my variable of interest for either function.
I added struc_zero = TRUE to ancombc2 and ancombc (both without random effects). ancom_result$zero_ind was FALSE in all cases in the former, whereas there were a few cases for which structural zeros passed FDR (0.05) in ancombc (and these hits made biological sense). However, I shouldn't model my data as if they were independent observations since they are related.
Thanks for following up on my last comment. Unfortunately, in this case, it does appear that ancombc2 lacks sufficient power for your study. As I mentioned earlier, we’re aware that ancombc2 can be underpowered in certain scenarios, and we’re actively working on an updated version (tentatively named ancombc3). If you’re planning to publish your data later on, we’d be more than happy to revisit the analysis using your dataset.
On a separate note, while applying mixed models to CLR-transformed relative abundance data is mathematically valid, I’d recommend exercising caution in interpreting the results. The CLR transformation redefines the reference frame to the geometric mean of all taxa, which may affect how results are contextualized. It would be helpful to acknowledge this in the discussion section of your manuscript.
Thanks for your response Huanglin. I'd be curious to see how your update changes things.
I'm also grateful for your comment on CLR. I spent many weeks reading about different types of statistical transformations available for compositional data. My conclusion was there is no "best option" and each has pros and cons.
In the end, I saw no reason to favour one over another and thought CLR had good (enough) theoretical backing, and was one of the most commonly used transformations in the field. I don't like having to use a pseudocount, although other methods (such as Bayesian imputation of zeros) decreased my sample size and I was less fond of that.
If you can tell me specific details (with references, ideally) about your comment that CLR "may affect how results are contextualized", I would definitely expand on this in the discussion.
I think there needs to be more work done on standardising the preprocessing of microbiome data, and researchers should have a better understanding of how their statistical pre-treatment influences the results and thus conclusions. This would reduce the heterogeneity of the results that are often seen in the field IMO.