Taxonomic Clean Up

jwdebelius · May 5, 2020, 5:30pm

Hi Friends,

I'm wondering how, if at all, people clean up taxonomic annotation from databases before publication? If not, why not? If you do, what do you clean and how?

As a more concrete example, if I decided to develop a Greengenes or Silva-like label for Batman characters because I am a nerd, how would you clean up these taxa?

k__DCU; p__Superhero; c__Gotham; o__Batfamily; f__Batman; g__; s__
k__DCU; p__Superhero; c__Gotham; o__[Batfamily]; f__Red_Hood;  g__Todd; s__Jason

D_0__DCU;D_1__Superhero;D_2__Gotham;D_3__Batfamily;D_4__Batman;D_5__ambigious_taxa; D_6__ambigious_taxa
D_0__DCU;D_1__Superhero;D_2__Gotham;D_3__Batfamily;D_4__Batman;D_5__Wayne; D_6__uncultured_superhero

Are there ways you'd like to be able to do taxonomic cleaning?

Thanks!
Justine

timanix · May 6, 2020, 5:50am

Hi!

Sorry, but it is genious

I cleaned a little bit taxonomy to remove some symbols like '' from your example and usually I am labeling like 'Wayne_uncultured_superhero' from your last example. But I keep original taxonomy, creating additional labels according to last taxonomy unit available to ASV table.

jwdebelius · May 6, 2020, 3:41pm

Thanks @timanix! Part of the reason I ask is because Im trying to write a script to clean automagically and so I want to figure out waht the optimal cleaning is for many people.

Pop culture themed test code and examples are a critical part of my development cycle.

Does is make a difference if I mention that means contested. So like, [Batfamily] is actually "contested Batfamily". (It took me years to learn this)?

timanix · May 6, 2020, 4:11pm

It is why I am always keeping original taxonomy. I need to clean my labels because when I feed labeled ASVs to tree constructing plugin it complains about numerous 'wrong' symbols. Now I got what you want to do. In Silva database, there is very annoying thing with some symbols - some times you can't find or replace them with a script. Solution was to run first

.replace('(','\\(').replace(')','\\)').replace('[','\\[').replace(']','\\]').replace('+','\\+')

and I could process them normally. Still idk reason for it

colinbrislawn · May 6, 2020, 6:03pm

Hi Justine,

Unless a PI or coauthor asks to remove them, I usually keep them in. While they could be considered a database artefact, I don't mind have unambiguous ranking labels in all my taxa strings.

Because I make most graphs with ggplot2 and Phyloseq, here is how I would remove the prefixes:

glom.top.melt <- psmelt(glom.top)
glom.top.melt$Phylum <- gsub('D_1__', '', glom.top.melt$Rank2)

Colin
P.S. More R code is in this repo.

jwdebelius · May 7, 2020, 5:46pm

Hi @colinbrislawn and @timanix,

Thanks! I'm struggling with whether or not i should write a standardized database tidying script for a plugin Im working on. And, like, whether I should add a "uncultured" or "unspecified" or "ambiguous" label to the inherited label. So, it sounds like people (you) are already doing it, but that its not systematic. And, while it's not that much more work (and lets me procrastinate on something I dont want to write), if its not going to be useful, its probably not worth doing. Ive got my own ever-growing collection of taxonomy cleaning notebooks stashed across three or four file systems, but given how suprisingly easy it has been to write a qiime2 plugin, it seemed worth considering here.

Thank you for your input!

Best,
Justine