Issues associated with binning features by taxonomic assignment


Thanks for providing these tools for analyzing compositional data - they’ve been very helpful for the analysis of our environmental DNA data looking at the COI gene.

As is mentioned in the “Help understanding DEICODE” post from March 2019, it is suggested not to bin features by taxonomic assignment before running DEICODE. However, for a project I am working on, we are most interested in the species that are causing our samples to cluster into two groups on our DEICODE PCA, not the different ASVs that make up those species. We’ve run the PCA both ways (with the unmerged ASVs, and the ASVs binned at the species level), and it isn’t significantly affecting the clustering of our samples (see attached).

Would you be able to provide some more information regarding why you advise against binning by taxonomic assignment before running DEICODE? If the DEICODE RPCA is giving consistent results for unmerged ASVs and ASVs binned by species, would you take this as justification for visualizing PC loadings at the species level rather than at the ASV level to simplify this analysis?

Thanks in advance for your response!


1 Like

Hello Markus,

Welcome to the forums! :qiime2:

First off, let's get @cmartino's advice on this topic.

You mentioned:

Trying it both ways it good!

Also good! This means that similar trends emerge regardless of technical / preprocessing methods.
(When changing minior methods modifies major results, yikes! :scream_cat: )

Until Cameron has time to provide feedback on feature collapsing, I can comment briefly on why it can be better to use "the different ASVs that make up those species" instead of collapsed species for all sorts of analysis.

There are three major reasons I choose to use features without merging by taxonomy:

  1. Merging can muddle your signal. ASVs that are the same species but respond differently will get merged, and this signal will be lost.
  2. The database and classifier will introduce some bias. ASVs that only have a hit at the family level in the database will get merged, just like ASVs that could only be classified down to the family level. Failings of the database or classifier will also muddle your signal.
  3. Why waste it? ASVs can offer resolution below the species level. That's what makes them awesome!

For balance, here are three reason I have previously merged feature based on taxonomy:

  1. Legibility. I can't make a bar plot of 2k ASVs and label all of them.
  2. Ease of writing. The PI finds it easier to discuss species, rather than "ASVs classified as species."
  3. Reviewer 3 prefers merged species to ASVs. What can ya' do :man_shrugging: :woman_shrugging:

This does not take into account DEICODE specific reasons not to merge. I'm also interested in feedback from Cameron.



Hi @markusmin,

Thanks for using DEICODE and RPCA! I think @colinbrislawn did a great job summarizing some of the benefits of not collapsing by the lowest common ancestor. I have found that LCA group summed tables can behave unpredictably. In this case, it seems to be okay but in others, it may not (there are likely many reasons for this). One way to link the taxonomy is to add arrows colored by taxonomic groups to the RPCA plot using the biplot command here.

I usually suggest (as @colinbrislawn pointed out) to color or group ASVs by taxonomies downstream of dimensionality reduction or differential abundance. In fact, this is what Qurro is built for. Qurro can interactively group by taxonomic levels in a log-ratio based on what ASVs are separating your biplot from DEICODE. There are some great tutorials for Qurro using DEICODE, ALDEx2, and Songbird here. This should help link the subject groupings to the taxonomic clades driving the separation.


Hi @cmartino and @colinbrislawn,

Thank you for your detailed responses!

My reasons for joining ASVs by LCA are very similar to the reasons @colinbrislawn listed for merging features based on taxonomy:

  1. We have over 1,000 ASVs, and plotting each of them individually makes for a monster of a plot.
  2. The PI wants to know what species are driving differences, and having a dozen different ASVs for the same species makes the answer to this question more complicated, thus requiring a lot more explanation.

I’ve found Qurro to be quite useful for looking at the log-ratios of our different taxa, but for visualizing our taxa for a figure, there were just too many ASVs to be able to distinguish any clear patterns - hence our desire to merge by taxonomy to simplify this visualization.

I’m happy to hear that it sounds like in our case, binning ASVs at the species level before running DEICODE is okay. @cmartino, would you be able to provide more general information about how merging by LCA can cause DEICODE to give odd or unpredictable results? I ask because moving forward, we plan on using DEICODE for other datasets, and it would be very useful to know more about why merging by LCA can go wrong (and what to watch out for).



1 Like

Hi @markusmin,

Have you considered filtering your data to limit your analysis to taxa abundant in a certain portion of samples or with a certain abundance? (I would check the Aitchinson distance before and after to make sure they remain correlated). That way, you’re limiting yourself to ASVs present in several samples at a certain abundance threshhold which are hopefully powered enough. to be worthwhile. Ive found it helps a lot with the tractability of a lot of my analyses.

I’m also going to strongly encourage you that your PI is wrong and you should work with ASVs. As both @colinbrislawn and @cmartino have said, collapsing removes resolution and offiscates patterns. As a very real example, I did an analysis recently where a pair of microbes that differed by a single nucleotide explained a huge amount of variation in my community structure. While it’s an extreme example, it’s definitely not a one-off in my experience. You also have the problem that, frankly, species level resolution sucks in most database, its unreliable if you’re working with short reads, and it’s even a weird idea for a subset of the tree of life where genetic material can be shared on mobile genes as well as through direct inheritance.



Hi @jwdebelius,

Thank you for your helpful tips! I’ve been constraining my analysis to only those ASVs that appear in at least 25% of our samples, but perhaps that isn’t a high enough threshold. If I increased that percentage or added a condition of abundance, that could be another way to reduce the number of ASVs plotted without having to merge by taxonomy.



1 Like


Its hard to recommend my favorite approach becuase it isnt implimented in QIIME (on my quarentine list) but I tend to like to do a join filter where “present” is defined based on some threshhold (typically tied to my rarefaction depth) and then I filter for a certain prevelance. I would still maybe consider low abundant taxa. Depending on my sample size, this filtering typically drops me to about 200.

Then, you might consider a secondary filtering for display. Are there specific organisms that matter (those that seperate along PCs, for example).