Identify differentially represented ASV using ANCOM and Random Forest

Hi I tried to identify ASV(s) that represented differently in the two microbiome (10 samples from each category) which separated really well with PCoA1 (45%), but ANCOM only gave 6 ASVs and they are not the top 10 from the Random Forest prediction top 10 important. Any ideas/suggestions?

Hi @Meng_Wu,

First, which metric showed separation? Different metrics tell you different things about your data that might need to be evaluated differently, so its definitely something to consider!

Second, is your seperation significant with permanova? (It should be, but is it?)

Third, two groups with 10 samples are relatively small to pick up for either random forest or ancom. I slightly worry about over fitting your RF model (although that’s not really my area of expertise and @Nicholas_Bokulich is probably a better person to ask). But, I think the two approaches are again, asking different things of your data.

I guess one question is whether you’ve tried using the two groups in PCoA space as a predictor in your model, or are you using some metadata category that correlates?

Best,
Justine

1 Like

Hi @jwdebelius,

Thank you so much for the quick reply. It showed separation both in unweighted unifrac(PCoA1, 45%) variation and weighted unifrac (PCoA2, 17%). Yes, it’s significant with permanova. I know 10 samples on each side (20 total) are not that much for random forest, however, @Nicholas_Bokulich Do you think it might be good enough for category analysis. I totally understand the two approaches are different, but I am wondering what you guys opinion on this? I know ANCOM is much more conservative, but I would assume the ones identified in ANCOM would show up as top ones in Random Forest?

I used the metadata category correlates with PCoA1 as a predictor.

Best,

Meng

Not necessarily. The world works in mysterious ways.

Yeah that's really low. I would discourage putting too much trust in the results and use the classify-samples-ncv method so that you can determine the variance in performance across multiple folds (this info is printed to the stdout so use --verbose to view).

3 Likes

If your stronger signal is unweighted UniFrac, it may be a presence/absence thing. Is your alpha diversity also significantly reduced in one group? If thats the case, you might be looking at something fun and semi stochastic which might or might not be picked up nicely by either feature-based method.

Best,
Justine

2 Likes

Hi Thank you so much for the response. @jwdebelius yes, the alpha diversity is also significant reduced in one group. I would thought the present/absent is the most strong signal and should be easy to picked by ANCOM, could you please elaborate a little more why it might or might not be picked up nicely by the feature-based method, and what a better method should I use?

Thank you so much!

1 Like

Hi @Meng_Wu,

So, if it’s a low-diversity problem it may be stochastic like i said - you’re seeing a loss of random features. Especially for a dataset this small, finding a clear trend in the features which are lost can be hard. You could try something like a logistic/poisson regression to predict the presence/absence of an ASV/OTU, but then you have to decide how to define “present”… which isn’t always easy. You may struggle to find features that “define” that clearly define the state. …Sorry I dont have a better idea.

Best,
Justine