Is it necessary to filter data before applying ANCOM?

Dear all,
I want to know it is necessary to fillter data(eg. filter low abundance feature or taxa) before analysis? What would have happened if I hadn’t filtered out the low abundance sequence? I can I still use this result ?
Another puzzle about ANCOM is :
How should I describe Ancom?
I have seen such a description
“ANCOM calculates pairwise log-ratios between combinations of taxa and considers how many times (W) the null hypothesis (no differenxe between each pairwise comparisons of taxa) is violated”

But I know the W is the times rejecet the sub- hypothesis,and the sub- hypothesis is “no differenxe between each pairwise comparisons of taxa”, and null hypothesis is W = some cutoff threshold? Im not sure. How can I describe the difference statistics of ANCOM in one sentence? I have not found a satisfactory description in the article so far. Is there a friend who can provide a reference?

Thank you for attention.

Hi @YuZhang :wave:

It depends. This is a decision you will need to make based on your familiarity with your data. I will quote the Parkinson’s Mice tutorial:

Filtering can provide better resolution and limit false discovery rate (FDR) penalty on features that are too far below the noise threshhold to be applicable to a statistical test. A feature that shows up with 10 counts could be a real feature that is present only in that sample; a feature that’s present in several samples but only got amplified and sequenced in one sample because PCR is a somewhat stochastic process; or it may be noise. It’s not possible to tell, so feature-based analysis may be better after filtering low abundance features. However, filtering also shifts the composition of a sample, further disrupting the relationship. Here, the filtering is performed as a trade off between the model, computational efficiency,

This is a case where a decision was made about filtering after considering the pros and cons.

It’s a bit different in the Moving Pictures tutorial example:

ANCOM assumes that few (less than about 25%) of the features are changing between groups. If you expect that more features are changing between your groups, you should not use ANCOM as it will be more error-prone (an increase in both Type I and Type II errors is possible). Because we expect a lot of features to change in abundance across body sites, in this tutorial we’ll filter our full feature table to only contain gut samples.

In this case, the decision to filter based on body-site is perhaps a bit more obvious/necessary.

Have you read the ANCOM paper?


Thanks. I already read the article. However, I am not good at statistics and can only understand the general idea of the article.

This is my understanding. But I dont know whether it is right or not. How can I describe this accurately?

The w score indicates the number of comparisons that are deemed significant. In this case, the w score indicates how many times a feature was found to be differentially abundant. Differential abundance is implicated for a feature when there is significant variance in the pairwise comparisons of log-ratios with other features.

I would recommend making sure you have a solid understanding of ANOVA, then re-read the ANCOM paper while trying to understand how it is different from ANOVA. There are some vidoes on YouTube and a good wikipedia article about compositional data.

I hope that helps! :tophat:

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.