Is it best to work on abundance tables of ASVs or taxa?

SunScript0 · May 6, 2022, 10:55am

Hello,

I've seen several 16s workflows which do most analysis on ASV/OTU abundance tables and only look into which taxa correspond to which ASVs later down the line. Meanwhile, in the analysis ive done thus far I have first generated genus level abundance tables and then done my analysis on those (alpha, beta diversity, differential abundance etc...). This has not been a problem until I started needing phylogenetic distance metrics: I could not find a way to make a suitable tree from taxonomical hierarchy. So here are my questions:

Is there a reason to do most analysis on ASV/OTU tables and only map to taxa later? After all, an ASV representative sequence can only be interpreted once it is "assigned" to a taxon, no? To be clear, I am specifically talking about 16s of gut microbiomes, so databases should be quite complete.
Is there a way to work with phylogenetic distances (uniFrac) using a phylogenetic tree constructed on taxonomical hierarchy instead of ASV representative sequence similarity?

jwdebelius · May 6, 2022, 8:28pm

Hi @SunScript0,

Welcome to the :qiime2: forum!

This is going to be a long answer, just poured myself a new cup of (and encourage you to join me in a beverage), because it's a big conversation.

In 16S analysis, we essential operate on this assumption that a specific molecular fingerprint reflects a set of functions and interactions in an ecosystem with phenotypic consequences for the overall organism. With 16S, we're making this assumption based on a phylogenetically identified fingerprint from a universal tree of life, and we're essentially saying "closer evolution, closer function/genome." With a taxonomic assignment, you're saying "same genus, similar function".

I think the other assumption is that by naming things we give them meaning. It's a rain afternoon, I have my aforementioned tea, but I think this is maybe more a philosophical question that it needs to be. I'd argue that community measurements in and of themselves, without any taxonomic naming, can be informative. It certainly can help to contextualize behavior (that whole "same genus, similar function" thing often works), but I can ask all sorts of questions about my data without every annotating the sequences.

I also think we need to talk about some of the challenges with annotation, even in relatively well characterized enviroments. There are some "everyone knows, no one talks about" issues in microbiome taxonomy to think about based on the interpretation/assignment.

First, taxonomy in and of itself is incredibly messy, even in macroscopic organisms. A phenotypic-based characterization is great... right up until you start discovering things like the fact that dinosaurs might have had features and therefore, chickens are more closely related to dinosaurs than reptiles are , despite what you might have been told in school. (Also, this seriously puts the Emu War into a slightly scarier context).
Using phenotypic morphology as a guide can sometimes be super useful, and sometimes you can miss key traits or end up with wierd things in convergent evolution.

And this is a problem before we get into a set of organisms that are really hard to grow in captivity and which don't do nice sexual or asexual reproduction. So, what are issues in bacteria, specifically:

Our databases are trying to capture uncultured/uncharacteristic life. The gut is relatively well mapped, but it doesn't necessarily mean that you'll get clean names for everything, just that they'll show up in the database. It wouldn't be so bad if the names behaved well.
Bacterial phylogeny/taxonomy is even more because our names are often based on our ability to culture organisms. So, whatever we used an thought was similar morphologically picked up a name. Then, that name got transfered to closely related things. And now we have a big jumble of things that are similar... ish. Because of that, it erm... turns out that some of the most common phyla in the human body are polyphyletic.
Maybe its not suprising given these issues that the most commonly used databases aren't really comparable. (This doesn't mean "go make your own DB for your analysis", it means "be aware and document carefully")
Plus the names keep changing. Have I mentioned that? The names keep changing.

These issues don't mean that I think you should never taxonomy: there's still utility in the assumptions you can make with taxonomic assignments. It just means that if you're looking for taxonomy to save you from noise, you might be sorely disappointed and not solving all your intended problems.

Personally, I like to work at an ASV or OTU level, because there's a lot of interesting things that happen within genera and smaller clades. There are a lot of examples of niche competition by closely related species, or where specific species/strains are related to an outcome of interest. (I recently co-authored a paper showing a single nucleotide difference in an ASV drove community cases and related to a cancer diganosis. It's not the only example, but it's one.) I think if you want to make the collapsed assumption, you can, but that you need to knwo wyou're missing information.

You probably could come up with something, but because taxonomy ≠ phylogeny, it likely won't be as good. In an ideal world, you could, but not right now. So, you can either chose to work on an uncollapsed table or you could stick to non-phylogenetic metrics, both are good choices.

For de novo processes, they're really different steps: you can't map taxonomy until you have representative sequences. Closed reference OTUs are assigned taxonomy in the clustering step: they inherit the label of their cluster.
Tutorials are often written with pedagogical goals/a planned workflow in mind; the order presented in a tutorial may or may not be the order in which people actually work. (Personally, I tend to assign taxonomy once I have representative sequences, since I view it as a processing step, but there are as many views on this a there are analysts.)

I hope this helps.

Best,
Justine

SunScript0 · May 23, 2022, 2:17pm

Thanks for the detailed answer! I have decided to switch to working on ASV tables at the very least while I'm looking at alpha and beta diversity, I guess I can only see benefits from doing that.
Later I will start doing differential abundance I will see how I feel about ASV tables in that context. Other than the concern about interpretation I think with ASVs there is a concern about sparsity and multiple testing. I'll see what comes up as differential and go from there.

jwdebelius · May 23, 2022, 8:27pm

Hi @SunScript0,

It is common to collapse data to the genus level with ASVs.
It's also an option to filter them, and only focus on more abundant features (for whatever definition of abundant you choose.)

Best,
Justine