Why all taxonomic levels on BarPlot have 'D' letter?

If I am not mistaken, when i used greengene it starts by the first letter of taxonomic level, e.g. 'k' kingdom; 'F' family; 'o' Order and so forth on.
Is it possible to fix it?

1 Like

Any fix to this issue?

Welcome to the forum, @Alex!

Sounds like you are using SILVA database, or another database that uses SILVA naming conventions. SILVA (and others) use the “D” prefix to indicate the taxonomic rank of each taxonomic label, e.g., to identify whether a label is the phylum or genus, etc.

In any event, the taxonomic names that you see are 100% coming from the reference database — QIIME 2 and its plugins do not modify these names in any way.

If you want to remove these labels, you will need to do so yourself. You can do this in the reference taxonomy itself, before importing to QIIME 2, or you can edit the taxonomy classifications that q2-feature-classifier provides by exporting, editing (e.g., programmatically with python, R, or even find/replace all in a text editor!), and re-importing.

Good luck!

1 Like

I can likely provide some insight here as I am one of the contributors that helped to format SILVA database for QIIME. The D_X__ convention was chosen to be as much of a unique and “safe” text string as possible, considering many of the bizarre taxonomy text annotations within the SILVA reference database. That is, it was meant as a quick fix to be able to search and parse these taxonomy strings.

The ‘D’ was a way of annotating the “taxonomic Depth”. At the time some of the code was written, there was a realization that the taxonomy provided for eukaryotes, neither had a consistent fixed depth of ranks, nor a rank consistently associated with a given depth. That is, some taxa have 13 taxonomic ranks, others 7, etc. So, all of the taxonomy strings were padded out to ~14-15 ranks, such that it’d be easier to coerce these strings into tools like RDP classifier, or scikit-learn. That is we had to initially satisfy the requirement that all taxonomy ranks were of equal length. :man_factory_worker::woman_factory_worker:

An additional example… for instance, level D_4__ for one eukaryote may refer to a “Family” whereas that same level may refer to a “Super Family”, etc… Thus, we avoided using the standard rank annotation style of Greengenes. Hopefully, this makes some sort of sense. :man_shrugging:

However, if you are using the SILVA 7-rank taxonomy files, and are only concerned with Archaea & Bacteria, then you can relabel the D_0__ through D_6__ as Domain / Kingdom through Species without much trouble. Again, the issue had more to due with the wonkiness of the Eukaryote taxonomy.

I have since found a way, I think, to obtain “Greengenes-like” taxonomy strings, and I’ve uploaded some quite crude prototype code here. Note: this has not been thoroughly tested and vetted yet! :volcano:

In brief, I realized that SILVA folks maintain a taxonomy tree that can be used to easily map, and extract, only those ranks we’d like to retain, by-passing all of the intermediate ranks such as “Sub-order”, etc…

We hope to leverage this approach to re-annotate the SILVA taxonomy strings in the future, we are still discussing and working out an approach for this. But if anyone would like to contribute to updating and/or testing an updated SILVA database (using this potential solution) please let us know.

There is your history lesson for the day. :man_teacher: :bulb:

:taco:

12 Likes

Thanks @Mike

It was indeed a good history lesson :slight_smile:

3 Likes

Thanks, @Nicholas_Bokulich. Yes, you are right, I am using SILVA database for taxonomic assignment. Initially, I was concerned about this label due to any problem I can encounter while performing analysis in R.
Seems it is not a problem since the taxonomic ranks are delimited with ‘numbers’ also.

As you suggested modifying the naming is another solution. Thanks for your input.

1 Like

Just wanted to point some readers of this thread here:

-Mike

3 Likes

Thank you for all that detail, Mike!

I have a question about all the ‘unknown/other’ names a lower taxonomy depths:

What do these mean? Can we treat these all as a single, mysterious group? :mage:

uncultured bacterium
uncultured
            <- just blank
uncultured organism
metagenome
uncultured rumen bacterium
unidentified

Colin

P.S. Congrats to the team for the 138 Silva release! :bouquet:

1 Like

Hi @colinbrislawn That is a good question! I am not associated with the SILVA folks, or am I involved with their maintenance of taxonomy. But I typically, treat these ambiguous identifiers as, i.e. "unidentified" or "ambiguous".

I'll try and parse the new SILVA v138 DB that recently came out. I'll post a link to these files when I get a chance. This way they will have 6 or 7-rank taxonomy labels as I've done for the previous 132 DB files. They are available here, until I can find a more permanent home for them:

-Mike

6 Likes