Filtering ANCOM-BC results: W values and fold-change

I was wondering how we could set a sensible threshold for differential abundance (ANCOM-BC) results.

For example, when performing a differential expression analysis in a RNA-seq experiment, they normally set a p-value or FDR threshold along with a fold-change threshold.

When performing a DA analysis, setting a Q-value threshold seems sensible to me (in fact, qiime composition da-barplot already allows you to do so with --p-significance-threshold). I also wanted to set a fold-change threshold (with some fancy :volcano: plots in mind). Again, da-barplot has an option for that (--p-effect-size-threshold), that makes me think that setting a numerical FC threshold can be reasonable. However, looking for some examples in the literature, I re-read Lin & Peddada ANCOM-BC paper and found something that I missed on my first reading:

Since there is no hard threshold available for DR to declare whether a taxon is differentially abundant or not, it was not included in this simulation study.

However, again Lin & Peddada in this review:

Without a hard threshold available for DR, as suggested in the original paper, we investigated the highest/lowest ranks of genera by selecting the top 25 and bottom 25 genera in terms of rank order of regression parameter estimates. [...] While implementing ANCOM, we used the 70th percentile of the distribution of W as the cut-off.

I also read here that it could be a good idea to filter the W values with a fixed threshold (e.g. 0.6, that I understand that means "keep top 60% W values"?). So I have some questions on this:

  1. Is qiime composition ancombc already doing some W cutoff? Maybe the .7 stated in Lin & Peddada's review?
  2. Do you find it reasonable to filter based on FC? Like DE experiments when we do things like "abs(FC) > 2" or "we keep top X up and down-regulated genes".

Sorry if the post is a bit chaotic - I'm preparing a talk on my results and I'm trying to figure out the best way of not bringing a lot of DA ASVs to the party.

Cheers,

Sergio

Hi @salias,

So, I dont have tidy answers but I maybe have something?

The W here refers to the W in ANCOM I/ANCOM II where the W was the portion of pairwise tests that were significant after FDR correction. I think that while they kept the W name, ANCOM-BC calculates the test statistic in a different way. So, the new W cutoff is probably not invovled and an FDR threshhold set apriori is probably better.

If I'm doing ranking like that, I personally tend to rely on other methods that use ranks and then construct a single statistic. As far as log fold change goes, one question is whether your coeffecients represnt a fold change or a log-fold change, in which case you'd say coeffecients greater than 1.

Personally, my approach in the past has been to filter taxa pretty ruthlessly before I start analyzing my data (a holdover from ANCOM I) and then accept the results I'm seeing.

Best,
Justine

3 Likes

Thanks for the reply!

I'm not familiar with DA/DR methods apart from ANCOM. Do you mean something like Songbird?

Yes, sorry I wasn't clear enough when I wrote the question. I said simply FC but I was referring to the log2FC that come in ANCOM-BC output. So coefficient > 1 would mean "double the abundance".

I see. I already filter based on taxonomy (removing ASVs annotated as too general / rubbish), but maybe I should also consider to filter based in number of samples or number of counts.

You have! Thank you again. Let's see what I can do with that.

1 Like

Hi @salias,

Yep! I tend to use either Songbird or just the coordinates from rPCA and construct an ALR.

Makes sense! I just wanted to verify!

I tend to filter (very much emperically!) based on the rarefaction depth or minimum converage I'd need to be confident I observed an organism in my shallowest sample and the number of observations I'd need to be able to run an effective poisson :fish: regression for relative risk associated with carrying the bacteria. For 16S, I typically assume I need at least 1 read to esablish an organism is present in my sample, so I set my "presence" threshhold at 1/(rarefaction depth). (It's also worth noting that I pick round rarefaction depths to make this easier!) If Im doing metagenomic, I typically estiamte a minimum required coverage to be confident I have the correct organism, typically 10-100x. And then I keep some portion of features present with at least that relative abundance in at least X% of samples. This logic is represneted in filter-features-conditionally in the feature-table plugin.

In my fecal studies, I typically drop 80-90% of my features using this type of method! But, as I keep telling my students, 1 sample does not a distribution make, and I don't care that the feature pops up only once, its not interesting until we can do things with it. ANCOM-BC should solve some of these issues for you, but its worth checking the sample number/filtering criteria.

Best,
Justine

3 Likes

I see! I definitely need to investigate more about all the methods available for this.

It's the first time I read about filter-features-conditionally (or maybe the first time I read it while actually paying attention to what I'm reading :sweat_smile:). I'll look into it!

I hope so, because I have very few samples and also the experimental design is very unbalanced - losing a few samples because of them having 0 features could mean eliminating an entire condition!

Thanks,

Sergio

3 Likes