Abbreviating taxonomic names in R

Hello! I have been having difficulty with taxonomic names in R and wanted to ask what are people's suggestions for this.

For context, I have been working with phyloseq objects made from qiime exports (feature table.qza, tree.qza, taxonomy.qza, and metadata). I then use the phyloseq objects for downstream analyses (e.g. differential abundance testing). I'm running into an issue with how taxonomic names are appearing in graphs (i.e., full taxonomic name from phylum to species) and I want to be able abbreviate this somehow to a taxonomic rank of my choosing.

The created phyloseq object has an otu_table (which has the sequences and their counts for each sample) and a tax_table (which associates the sequences with the taxonomic names and each taxonomic rank has their own column). Should I try to merge the otu_table and tax_table together? My concern for this is that the combined table would lack metadata until I then merge metadata with it, which seems like it defeats the purpose of even generating a phyloseq object.

Should I be trying to split the taxonomic names first before running any downstream analyses? My issue with this is that some of the taxonomic ranks have the same delimiter within its rank as between ranks (e.g., family_genus_species_group10).

Since I do use ggplot2, I was suggested to use labels of the preferred taxonomic names and then add the labels to the ggplot code but I feel like this would be tedious to do every time and might be prone to error if you mislabel/misremember

labels <- c("A", "B", "C")
scale_x_discerete(labels = labels)

I would appreciate any advice and suggestions on how to shorten/abbreviate taxonomic names when using phyloseq objects in R. Thank you! :cherry_blossom:

1 Like

Hello @MBugay

This is an excellent question and a very common challenge when working with taxonomic data for visualizations! I agree that manually creating labels for every plot is tedious and error-prone, and it's best to use the phyloseq data structure as much as we can.

I guess my first suggestion is to store all code in a notebook, if you are not already. I use notebooks to track what has been run and try new things quickly! A good dry-lab notebook is essential, just like in the wet-lab!

I got started using R Studio because it supports R Markdown files.
I now use VS Code and have it set up to use my Qiime2 conda environments.

Here's what I would try: extract your taxonomy into a data frame, add your new custom labels there, and then merge this updated taxonomy table into a new phyloseq object.

Here's some example code (also on GitHub):

# Load required libraries
library(phyloseq)
library(tidyverse)

# Load example data
data(GlobalPatterns)
GlobalPatterns

# Let's use the GlobalPatterns phyloseq object

# Extract tax_table and convert it into a tibble for easy edits
taxa_df <- GlobalPatterns |>
  tax_table() |>
  as.data.frame() |> 
  # Keep full ASV IDs for merging later!
  as_tibble(rownames = "ASV_ID") 
taxa_df # view the tibble
# Create new, shorter taxa labels in new columns.
taxa_df_custom <- taxa_df |>
    mutate(
      # Find the most specific taxonomic rank available for a clean label
      MostSpecific = coalesce(Genus, Family, Order, Class, Phylum),
    
      # Create a unique version for plotting to prevent aggregation of
      # different ASVs that have the same taxonomy (e.g., multiple Lactobacillus)
      Taxa_unique = str_c(MostSpecific, " (", str_sub(ASV_ID, 1, 5), ")")
    )
taxa_df_custom # view the new column
# Convert back into a phyloseq tax_table.
# We turn the ASV_IDs back into rownames for phyloseq!
new_tax_table <- taxa_df_custom |>
  column_to_rownames("ASV_ID") |>
  as.matrix() |>
  tax_table()

# Replace the old tax_table in your phyloseq object.
# The original 'ps_obj' remains unchanged
GlobalPatternsNew <- GlobalPatterns 
tax_table(GlobalPatternsNew) <- new_tax_table

# Now, check for new taxonomy columns!
head(tax_table(GlobalPatternsNew))

Code on GitHub

3 Likes