Abbreviating taxonomic names in R

Hello! I have been having difficulty with taxonomic names in R and wanted to ask what are people's suggestions for this.

For context, I have been working with phyloseq objects made from qiime exports (feature table.qza, tree.qza, taxonomy.qza, and metadata). I then use the phyloseq objects for downstream analyses (e.g. differential abundance testing). I'm running into an issue with how taxonomic names are appearing in graphs (i.e., full taxonomic name from phylum to species) and I want to be able abbreviate this somehow to a taxonomic rank of my choosing.

The created phyloseq object has an otu_table (which has the sequences and their counts for each sample) and a tax_table (which associates the sequences with the taxonomic names and each taxonomic rank has their own column). Should I try to merge the otu_table and tax_table together? My concern for this is that the combined table would lack metadata until I then merge metadata with it, which seems like it defeats the purpose of even generating a phyloseq object.

Should I be trying to split the taxonomic names first before running any downstream analyses? My issue with this is that some of the taxonomic ranks have the same delimiter within its rank as between ranks (e.g., family_genus_species_group10).

Since I do use ggplot2, I was suggested to use labels of the preferred taxonomic names and then add the labels to the ggplot code but I feel like this would be tedious to do every time and might be prone to error if you mislabel/misremember

labels <- c("A", "B", "C")
scale_x_discerete(labels = labels)

I would appreciate any advice and suggestions on how to shorten/abbreviate taxonomic names when using phyloseq objects in R. Thank you! :cherry_blossom:

1 Like

Hello @MBugay

This is an excellent question and a very common challenge when working with taxonomic data for visualizations! I agree that manually creating labels for every plot is tedious and error-prone, and it's best to use the phyloseq data structure as much as we can.

I guess my first suggestion is to store all code in a notebook, if you are not already. I use notebooks to track what has been run and try new things quickly! A good dry-lab notebook is essential, just like in the wet-lab!

I got started using R Studio because it supports R Markdown files.
I now use VS Code and have it set up to use my Qiime2 conda environments.

Here's what I would try: extract your taxonomy into a data frame, add your new custom labels there, and then merge this updated taxonomy table into a new phyloseq object.

Here's some example code (also on GitHub):

# Load required libraries
library(phyloseq)
library(tidyverse)

# Load example data
data(GlobalPatterns)
GlobalPatterns

# Let's use the GlobalPatterns phyloseq object

# Extract tax_table and convert it into a tibble for easy edits
taxa_df <- GlobalPatterns |>
  tax_table() |>
  as.data.frame() |> 
  # Keep full ASV IDs for merging later!
  as_tibble(rownames = "ASV_ID") 
taxa_df # view the tibble
# Create new, shorter taxa labels in new columns.
taxa_df_custom <- taxa_df |>
    mutate(
      # Find the most specific taxonomic rank available for a clean label
      MostSpecific = coalesce(Genus, Family, Order, Class, Phylum),
    
      # Create a unique version for plotting to prevent aggregation of
      # different ASVs that have the same taxonomy (e.g., multiple Lactobacillus)
      Taxa_unique = str_c(MostSpecific, " (", str_sub(ASV_ID, 1, 5), ")")
    )
taxa_df_custom # view the new column
# Convert back into a phyloseq tax_table.
# We turn the ASV_IDs back into rownames for phyloseq!
new_tax_table <- taxa_df_custom |>
  column_to_rownames("ASV_ID") |>
  as.matrix() |>
  tax_table()

# Replace the old tax_table in your phyloseq object.
# The original 'ps_obj' remains unchanged
GlobalPatternsNew <- GlobalPatterns 
tax_table(GlobalPatternsNew) <- new_tax_table

# Now, check for new taxonomy columns!
head(tax_table(GlobalPatternsNew))

Code on GitHub

4 Likes

Hi @colinbrislawn

Thank you so much for the prompt and detailed response! To clarify, I am actually working in RStudio, not R. I mistyped earlier.

I didn't realize we could update and merge a taxonomy table into ps_obj. That makes a lot of sense now.

I will plan on trying your suggestion this week and will report back later. Again, thank you for your help.

1 Like

An easy alternative is to use the microbiomeX (mbX) package.

To do this, you’ll need:

  • the level-7.csv file you can download from [view.qiime2 website] when you create barplots, and
  • the same metadata.txt file that you used in QIIME2.

You can clean your data at any taxonomic level. For example, to clean at the genus level:

Installation:

install.packages("mbX")
library(mbX)

Code run:

ezclean (“level-7.csv”, “metadata.txt”, “g”)

Easy use guide: mbX-R-package/easy_tutorial_mbX.pdf at main · utsavlamichhane/mbX-R-package · GitHub
All functions: mbX_0.2.0_official_CRAN_version/Functions_test.md at main · utsavlamichhane/mbX_0.2.0_official_CRAN_version · GitHub

Hello @colinbrislawn

Okay, I was able to follow your suggestion and replace the old tax_table.

However, when it comes to actually plotting results, I am still figuring it out. For instance, with ancombc2 results, you can generate a results dataframe that has taxon (e.g., Family, Genus, etc) as one of the columns, depending on what taxonomic level you set it to. You can then take this results dataframe and filter it for your significant results, which you can then plot. An issue is that the results dataframe does not include any of the new columns (MostSpecific, Taxa_unique) from the updated phyloseq object. I understand why this happens, but, because the results dataframe and taxa custom dataframe do not share any columns, I cannot merge these two dataframes together. Does that make sense?

Yeah, unfortunately it does make sense. Phyloseq and ancombc2 use different data structures, so converting from one to another or merging their outputs takes some work.

You could start here. Maybe is ran ANCOM BC2 on the updated phyloseq object that had these new columns?

There may be a good way to do this using feature built into of ANCOM! You may have found this already, but here's the GitHub, just in case: GitHub - FrederickHuangLin/ANCOMBC: Differential abundance (DA) and correlation analyses for microbial absolute abundance data

You could start here. Maybe is ran ANCOM BC2 on the updated phyloseq object that had these new columns?

I'm a little confused by what you mean here. Could you please elaborate?

To clarify, I did run ANCOMBC2 on the updated phyloseq object that had the new columns. The ANCOMBC2 results, unfortunately, do not include these new columns as it bases the taxon name off of whatever is the selected taxonomic level (e.g., if I select Family as the taxonomic level, then it will only select whatever is in the Family column).

This is what I was suggesting! I'm sorry that it didn't help...

I would have to dig into the ANCOMBC2 results data structures to see what to try next...

@utsav_Lamichhane, is this something you would be interested in working on?

Thank you for your suggestion, but, if possible, I would like to continue working with phyloseq. From my understanding of the mbX package, this would take the place of phyloseq and any analysis with the phyloseq object?

I'm looking at the ANCOMBC2 results again and many of the taxon names are shortened appropriately for their taxonomic level (e.g., Family level = Abditibacteriaceae), but then there are some that are not (e.g., Family level = d__Bacteria_Acidobacteriota_Acidobacteriae_Acidobacteriales_uncultured).

I wonder for the rows where the names are not shortened if I should just try something similar to what you suggested before where I should separate them? Maybe str_extract for this part?

Unfortunately, I do not think I could attach a unique ASV_ID to the taxon as the ASV_ID sequence is also not included in the ANCOMBC2 results, but maybe I could attach "unidentified Family" to those who had not been abbreviated that level?

> # Create new, shorter taxa labels in new columns.
> taxa_df_custom <- taxa_df |>
>     mutate(
>       # Find the most specific taxonomic rank available for a clean label
>       MostSpecific = coalesce(Genus, Family, Order, Class, Phylum),
>     
>       # Create a unique version for plotting to prevent aggregation of
>       # different ASVs that have the same taxonomy (e.g., multiple Lactobacillus)
>       Taxa_unique = str_c(MostSpecific, " (", str_sub(ASV_ID, 1, 5), ")")
>     )
> taxa_df_custom # view the new column
1 Like

@colinbrislawn Yes, happy to help dig in.

@colinbrislawn

Here's a good update! I recently spoke to a peer and they actually run ANCOMBC2 without specifying the taxonomic level, which generates results that DO include the ASV sequences as the taxon name. Using your suggestion of the taxa_df_custom, I was able to merge the two dataframes since taxon (i.e., ASV sequences) and ASV_ID are now shared between the two dataframes.

I then used ggplot with this now combined dataframe and was able to label with the Taxa_unique column. I followed the example in the ANCOMBC2 tutorial for filtering the dataframe and using ggplot.

# ancombc2 will be performed at lowest level of the input data 
# i.e., it should have the ASV sequences as the taxon 
ancombc2(data = x,
tax_level = NULL,
fix_formula = A + B) 
#create dataframe of ANCOMBC2 results
ancom1_df <- ancom1$res

#merging ANCOMBC2 results dataframe to custom bacteria dataframe
ancom1_df2 <- merge.data.frame(
ancom1_df,
taxa_df_custom, 
by.x = "taxon", 
by.y = "ASV_ID")

1 Like