New plugin for cleaning taxa labels

Hi guys, hope you are all doing well!

Just a quick post to highlight a small plugin I've been working on — q2-taxa-clean — which addresses a frustration I kept running into after taxonomy classification: uninformative terminal labels cluttering feature tables and figures.

After running qiime feature-classifier classify-sklearn, it's common to end up with strings like:

d__Bacteria; p__Firmicutes; c__Clostridia; o__Oscillospirales; f__Ruminococcaceae; g__uncultured; s__uncultured_bacterium

qiime taxa collapse flattens everything to a single fixed level and merges distinct taxa in the process, and rescript edit-taxonomy requires manual find-and-replace one pattern at a time.

q2-taxa-clean walks each feature individually from the terminal level back toward the root, finds the deepest informative name, and returns the full path truncated at that level:

d__Bacteria; p__Firmicutes; c__Clostridia; o__Oscillospirales; f__Ruminococcaceae

Output is valid FeatureData[Taxonomy], so it drops straight into qiime taxa barplot, qiime taxa collapse, or any downstream tool that expects standard taxonomy strings. There's also a --p-flat-labels option that returns single readable names (Ruminococcaceae, Lactobacillus reuteri) for figure legends and axis labels.

Install into your QIIME 2 environment with:

bash

pip install git+https://github.com/ssaundy/q2-taxa-clean.git

I'd really appreciate any feedback, or cases it struggles to handle, or taxonomy strings it gets wrong. Happy to hear thoughts, or take example datasets so I can make it more robust!

Thanks :slight_smile:

Scott

2 Likes

Nice use of the type system to fix your complaint! Also glad to see tests :slight_smile:

One thing I might suggest thinking carefully about is if the suffixes _1/_2/_3 at going to be the right approach. You may discover someone starts talking about Firmicutes_2 as an objective phylum to the dismay of the imaginary reader in my head.

Also, I think there’s nothing technically wrong with a FeatureData[Taxonomy] with only the single level (for your legend-only/pretty publication text). It just has only one level. I suspect some of our code doesn’t work with it, but that’s probably more on us than anything.

6 Likes

Thanks for the feedback!

I’ll have a think about the suffix approach, are there any conventions you know of for disambiguating duplicates? Also do you know off hand the main tools that would struggle with single level strings?

Thanks again

Scott

I tend to describe things I can't classify as "unclassified" (often uncl.) or "unspecified" (unsp.) to distinguish between things that I cannto assign in the NB classifier (unclassfiied) and things taht didn't have a name in the orginal database (unspecified). The second case is less common now than it was 5-10 years ago, but a convention I still keep around some of the time.

I'm also a big fan of keeping my level lables (e.g. uncl. f. Rumminococaeae) because it tells me about my inheritance structure and makes it easier for my readers to disambiguate taxonomic levels.

I'm also not a huge fan of the suffix idea, partially becuase how it potentially overlaps with GTDB/Greengenes 2 where the suffixes are meaningful and Firmictutes and Firmicutes_A are phylogenetically distinct groups. I think as long as we have some kind of indicator that you're mixing multiple unclassified things into the same label, we're probably good. Your taxonomic model is imprecise, but all our taxonomic models are imperfect and imprecise :woman_shrugging:. If people want something specific, the sequence (ASV) level is a great way to go.

5 Likes

This is fantastic Justine thanks for your advice, I’ll definitely look into using your uncl. f. Ruminococcaceae convention in replacement of the suffixes to avoid collision risk. I’ll get back to you here when I’ve got something solid!

All the best,

Scott

I think @jwdebelius’s approach is much better than anything I would have come up with.

I was pretty sure we turned those into lists via a split on ;, so in theory, it should always “just work”. But I assumed you had run into something specifically to disavow the FeatureData[Taxonomy] type as a bit of a white-lie for rendering purposes. I just wanted to say that I think you’re fine with that type in either mode.

1 Like