Differential abundance (DA) analysis before or after feature table collapse to a specific taxonomical level


So this is kind of a dilemma for me to decide whether to do DA analysis before collapsing the feature table to a specific taxonomical group or after?
Because I think the whole point of having an ASV_level feature table is to have a higher resolution of the data and not to group them according to any common characteristics they might have. However, once we group them to a taxonomical level, are we basically doing an OTU-like approach sacrificing resolution somehow?
On the other, one needs to know if a feature is differentially abundant, what exactly is that feature so that one could have biological interpretation. So, wouldn’t it be better to have DA with an un-collapsed feature table and then do taxonomical annotation? I usually see the opposite, people/tutorials usually first collapse/aggregate to a specific level and then proceed to DA. But wouldn’t that be changing your feature table with unknown downstream consequences?
I think the very core of my question would be whether or not to collapse the feature table before basically any statistical analysis. How one could assess how the collapse would affect the end results? Should one do the two approaches and then compare them at the end? What if there’s a conflict?

In my opinion, both approaches have they own strength and weaknesses. No need to choose between the resolution and grouping by features, so maybe it is a good idea to perform analysis on both levels. Moreover, you can draw new conclusions based on the comparison of two approaches.


Hi @Parix,

The short answer is that it depends on your system. I’m going to link you to a recent paper of mine where our results were driven by a pair of features that were only detectable with ASVs. We did genus-level testing at the request of a reviewer and none of the genera were differentially abundant because the results centered around shifts in organisms from like four genera.

On the other hand, I’ve had analyses where everything from the same genus behaved the same way, and our signals got lost because of it. That was a case where clustered or genus level data might have served us better.

So, I think @timanix is exactly right:

There are a few caveats here. Phylum level resolution not a good way to describe your data. It’s comparing things with spinal cords to things with chitinous shells, and certainly most of the biological assumptions that taxonomy is supposed to be short hand for don’t hold at that level. Second, well individual features often behave in concert, high level signals are based on those low level trends. So a difference at family level may be due to three ASVs that are changing. And three, taxonomy is nested, so if you’re prefiltering, filter together and then apply the nested methods.