I was wondering how we could set a sensible threshold for differential abundance (ANCOM-BC) results.
For example, when performing a differential expression analysis in a RNA-seq experiment, they normally set a p-value or FDR threshold along with a fold-change threshold.
When performing a DA analysis, setting a Q-value threshold seems sensible to me (in fact, qiime composition da-barplot already allows you to do so with --p-significance-threshold). I also wanted to set a fold-change threshold (with some fancy plots in mind). Again, da-barplot has an option for that (--p-effect-size-threshold), that makes me think that setting a numerical FC threshold can be reasonable. However, looking for some examples in the literature, I re-read Lin & Peddada ANCOM-BC paper and found something that I missed on my first reading:
Since there is no hard threshold available for DR to declare whether a taxon is differentially abundant or not, it was not included in this simulation study.
Without a hard threshold available for DR, as suggested in the original paper, we investigated the highest/lowest ranks of genera by selecting the top 25 and bottom 25 genera in terms of rank order of regression parameter estimates. [...] While implementing ANCOM, we used the 70th percentile of the distribution of W as the cut-off.
I also read here that it could be a good idea to filter the W values with a fixed threshold (e.g. 0.6, that I understand that means "keep top 60% W values"?). So I have some questions on this:
Is qiime composition ancombc already doing some W cutoff? Maybe the .7 stated in Lin & Peddada's review?
Do you find it reasonable to filter based on FC? Like DE experiments when we do things like "abs(FC) > 2" or "we keep top X up and down-regulated genes".
Sorry if the post is a bit chaotic - I'm preparing a talk on my results and I'm trying to figure out the best way of not bringing a lot of DA ASVs to the party.
So, I dont have tidy answers but I maybe have something?
The W here refers to the W in ANCOM I/ANCOM II where the W was the portion of pairwise tests that were significant after FDR correction. I think that while they kept the W name, ANCOM-BC calculates the test statistic in a different way. So, the new W cutoff is probably not invovled and an FDR threshhold set apriori is probably better.
If I'm doing ranking like that, I personally tend to rely on other methods that use ranks and then construct a single statistic. As far as log fold change goes, one question is whether your coeffecients represnt a fold change or a log-fold change, in which case you'd say coeffecients greater than 1.
Personally, my approach in the past has been to filter taxa pretty ruthlessly before I start analyzing my data (a holdover from ANCOM I) and then accept the results I'm seeing.
I'm not familiar with DA/DR methods apart from ANCOM. Do you mean something like Songbird?
Yes, sorry I wasn't clear enough when I wrote the question. I said simply FC but I was referring to the log2FC that come in ANCOM-BC output. So coefficient > 1 would mean "double the abundance".
I see. I already filter based on taxonomy (removing ASVs annotated as too general / rubbish), but maybe I should also consider to filter based in number of samples or number of counts.
You have! Thank you again. Let's see what I can do with that.
Yep! I tend to use either Songbird or just the coordinates from rPCA and construct an ALR.
Makes sense! I just wanted to verify!
I tend to filter (very much emperically!) based on the rarefaction depth or minimum converage I'd need to be confident I observed an organism in my shallowest sample and the number of observations I'd need to be able to run an effective poisson regression for relative risk associated with carrying the bacteria. For 16S, I typically assume I need at least 1 read to esablish an organism is present in my sample, so I set my "presence" threshhold at 1/(rarefaction depth). (It's also worth noting that I pick round rarefaction depths to make this easier!) If Im doing metagenomic, I typically estiamte a minimum required coverage to be confident I have the correct organism, typically 10-100x. And then I keep some portion of features present with at least that relative abundance in at least X% of samples. This logic is represneted in filter-features-conditionally in the feature-table plugin.
In my fecal studies, I typically drop 80-90% of my features using this type of method! But, as I keep telling my students, 1 sample does not a distribution make, and I don't care that the feature pops up only once, its not interesting until we can do things with it. ANCOM-BC should solve some of these issues for you, but its worth checking the sample number/filtering criteria.
I see! I definitely need to investigate more about all the methods available for this.
It's the first time I read about filter-features-conditionally (or maybe the first time I read it while actually paying attention to what I'm reading ). I'll look into it!
I hope so, because I have very few samples and also the experimental design is very unbalanced - losing a few samples because of them having 0 features could mean eliminating an entire condition!