Differential abundance with ANCOM-BC: is subsetting to a specific genus "legal" or just p-hacking?

I’m facing a bit of a statistical crossroads. We are studying fungal communities with ITS2 and, since we are interested particularly in Fusarium, we also used an additional amplicon that resolves better at the species level for that genus (TEF1).

The primers for TEF1 were specifically optimized to capture all Fusarium species. Of course, they do amplify other fungal taxa as "off-targets," but we haven't validated if they represent the rest of the fungal community reliably (I believe that is not the case).

We already performed ANCOM-BC at the ASV level on ITS sequences, but my supervisor asked me if it would make sense to subset the TEF1 table to only include Fusarium ASVs and then run the differential abundance (DA) analysis on that specific subtable.

My question/concern is, wouldn't this break some assumptions of ANCOM-BC? Like, for example, assuming that most taxa do not change. Do you think it makes sense methodologically to test differential abundance on a subset of ASVs? Would you keep the (potentially unreliable) off-target fungi in the table for the sake of the model's math?

Thank you in advance, and also sorry for the inactivity in the forum - there are a lot of things going on currently (preparing papers, stays, the PhD itself)

Best,

Sergio

2 Likes

Hi @salias, From my perspective this would be a valid approach - removing off-target taxa from your TEF1 feature table is analogous in nature to quality control, or similar say to removing mitochondria or chloroplast ASVs from a 16S feature table before downstream analysis. By using the primers to target Fusarium specifically, you've chosen to look at that genus a priori, so this filter is helping to get you there.

This would be different than, for example, running ANCOMBC, seeing some features falling just above a q-value threshold (say, q slighly greater than 0.05), then filtering out other features to have fewer comparisons, and re-running ANCOMBC to achieve a q less than 0.05.

I don't think this is an issue statistically for ANCOMBC, but that's where I'm a little less confident. Someone else should please feel free to :qiime2: in on that aspect.

6 Likes

Hello @gregcaporaso

That makes total sense. I hadn't thought about it as a QC step similar to e.g. removing mitochondria or unassigned ASVs. I'll move forward with the filtered table and just document it clearly as a targeted analysis (and not a post hoc try to inflate p-values).

Thank you so much for the help! I'll also wait just in case anyone familiar with ANCOM-BC has something to add.

Best,

Sergio

1 Like

Sorry to pop in, I hope its okay i show up late @gregcaporaso and @salias,

I'll go a step further on the "Its okay to run ANCOM-BC on a subset of your data". I think if you walk in with an apriori hypothesis, its okay to just test those taxa (and possibly appropriably do FDR if its a lot).
So, if you're looking at Fusarium species specifically because you have a specific hypothesis,* its valid to just test those species based on an appropriate distrubion and transform of your data (CLR, prevalence, ALR, something? I think its valid to test taxa that you have an apriori hypothesis about even if you dont see a community-level difference. You've already got some basis of evidence so its more confirmatory compared to an untargeted exporatory analysis.

Anyway, my two cents :coin:.

Best,
Justine


*Support migh be literature, inferred function, esoteric knowledge, or a firmly held belief. Had a collaborator tell me that I should find baltic halophiles n fecal samples after eating a small amount of fermented fish that cannot be legally transported on planes due to the fear the smelly fish will explode.
7 Likes

Hi @jwdebelius

Thank you for sharing your thoughts! Since ANCOM-BC handles the log-ratio transformation internally, it sounds like running it on the Fusarium subset is the way to go.

Really appreciate the help from everyone here!

Best,

Sergio

3 Likes