Database-specific taxonomy file formats

gregcaporaso · May 8, 2018, 1:47am

One thing that has been in the back of my mind for a long time now is that we need to have a way to allow users to hide the "taxonomic level indicators" (e.g., the leading k__ in Greengenes taxonomy files) from their taxonomy strings in visualizations, especially ones that end up in publications. That's really hard to build into a visualization though, since a visualization never knows if or what level indicators are present.

I was thinking it might make sense to define some new formats that were database-specific, and these could have transformers that strip that information on import. For example, we could have a Greengenes-13-8-Taxonomy which could be used to import a taxonomy file in the Greengenes 13_8 format, and a corresponding transformer would strip the k__, p__, ..., from the taxonomy labels on import. We could also have a Silva-specific format which would handle stripping of the Silva level indicators.

Thoughts on this approach?

antgonza · May 8, 2018, 3:50am

I actually like the taxonomic levels as prefixes cause it's hard for me to remember which level each of the names are; for example: Lactobacillales is it an Order or a Family, what about Lactobacillaceae? Even numbers can be confusing, for example: is level 5 Family or Order? However, I like the idea stripping as an option at visualization creation but perhaps it will be better if the visualization knew about how to strip so the user can simply click a button an strip.

aphanotus · May 8, 2018, 3:00pm

I like the idea of an option to remove the level indicators from databases with 7-level taxonomies. However, it would be useful for users to retain the option of using rank-free taxonomy.

SoilRotifer · May 8, 2018, 3:14pm

I agree that there should be an option to strip away the level indicators.

While since we are discussing taxonomy strings / rank indicators: I have some ideas on how to coerce the SILVA taxonomy strings to something similar to a Greengenes-like format, i.e. with the actual k__, p__,... ranks. This, I think, can be done making use of the taxonomy-map files here.

I discussed some of this with @wasade and Pelin last week. Might be a good idea for a plugin. Which I'd be happy to start working on, with a little help. Thoughts?

marchywka · May 8, 2018, 3:20pm

From my own efforts to reconcile data from various sources taxonomy is a huge problem. Generally to save memory and confusion passing around tokens, integers referencing a string or node number in a tax tree, seems like a good approach as long as you don't mix up the tables. I found several transforms helpful -stripping leading x__, lower case, and whitespace changes- but still getting conforming names is of limited value when taxonomies vary a lot. When I got my first set of data with leading x__ I stripped it but did find later it may be helpful and when I'm reading papers it generally does not get in the way of anything. fwiw.