I understand this is a common question and discussed numerous times on different forums as to why this phenomenon happens and how to solve it (ignore it), but it would be nice to hear from some of you researchers as to how you interpret them in your research?
Data coming out from ASVs methods such as dada2 is common to have multiple ASVs classified to the same species, most likely dominant species in the microbiome.
My question is, is it generally accepted to treat these ASVs as different features? Because taxonomy collapse methods agglomerate these features, and it’s common to see most abundance bar-plots with agglomerated taxonomy. But taking a closer look some of them really do have considerably different sequences and the assignment confidence from naive-bayes output are different as well. And ASVs abundance of the same species can contradict themselves in different samples, so agglomerating them is in essence eliminating useful signal? Data before and after collapsing can give different PCA (or other statistical methods) results as well. So what are your thoughts? If there are two ASVs assigned to the same species, say ASV1 positively correlates with temperature but ASV2 negatively correlates, consistently in different samples, is it reasonable to interpret them as different strains? And do you agglomerate the table or use the huge original feature table for statistical tests/analysis?
I think its an important and under discussed issue, but certainly something we talk about a lot here.
To some degree it depends on the context and the audience, but IMO, you need to treat them as different features. I think for a lot of people, its uncomfortable to think about the same "species" having multiple ASVs, especially coming out of OTUs where a 97% OTU was supposed to be a species proxy. (I dont fully remember what the reference was for this, but that was the assumption.) So, you're suppose to have a somewhat unique OTU for each species. That said, our database resolution also isn't perfect, OTU-based clustering has some major issues, and taxonomy, phylogeny, and naming is always more complex than makes people comfortable. So, for people who need names to cling to (sometimes like a ), the idea of treating the individual ASVs as their own features becomes very uncomfortable. And, I spend a long time having this discussion, often unsuccessfully. It doesn't mean that either approach doesn't come with a bias, nor that you can't do complementary work.
...I feel like I should mention another specific personal bias: I hate species level designations in microbiome data, unless there's a cultered or isolated organism. They're are very few databases where I trust species level designations, and so if I aggregate, I aggregate at genus level, anyway.
A bar plot is a great visualization for focusing on the most abundant organisms, and because fo the way our brains process information, it makes sense to collapse things. It doesn't necessarily mean that you should be doing your statistics on that qualitative presentation (or drawing sweeping conclusions!). So, yeah, Im all for a collapse here!
I'm also 100% with you on this. I recently did an analysis where we found opposing behavior in closely related ASVs. Collapsing these would have lost what ended up being a major signal in my data. Maintaining that seperation can end up being really important. There is a draw back: if there are multiple closely related ASVs that are rare but have the same behavior, failure to aggregate might mean that you lose power. And, to some degree, you need to figure out where you want to risk potential false negatives?
I treat them as different features, and I have started labeling them as "strain-level" variation if I can confirm they belong to the same species (i.e. cultured organism with a clear species definition.) Otherwise, they're co-excluding ASVs from the same genus. And, I just present an ASV name and use it.
I like to work on the ASV table. For diversity, I work on a rarified full table but prefer phylogenetic metrics, since these to some degree account for similarity between the ASVs. And then, for feature-based analyses I tend to work on a filtered ASV table where I've removed low abundance features. My experience has been that when I collapse, its almost always a single ASV which drives higher-level differences (and often I lose those differences when I collapse my data).
However, you can also always run ASV-level and collapsed analyses in parallel and see if they agree! If you're working with a non-phylogenetic metric like Bray-Curtis, you may actually learn a lot by doing this!
Wow thank you @jwdebelius for the insights! Very helpful. I agree with you completely, treating different ASVs as different biological features make a lot more sense, if the data is handled properly, seems like theres huge room for interpretation and filtering of useful signals.