I can likely provide some insight here as I am one of the contributors that helped to format SILVA database for QIIME. The D_X__
convention was chosen to be as much of a unique and “safe” text string as possible, considering many of the bizarre taxonomy text annotations within the SILVA reference database. That is, it was meant as a quick fix to be able to search and parse these taxonomy strings.
The ‘D’ was a way of annotating the “taxonomic Depth”. At the time some of the code was written, there was a realization that the taxonomy provided for eukaryotes, neither had a consistent fixed depth of ranks, nor a rank consistently associated with a given depth. That is, some taxa have 13 taxonomic ranks, others 7, etc. So, all of the taxonomy strings were padded out to ~14-15 ranks, such that it’d be easier to coerce these strings into tools like RDP classifier, or scikit-learn. That is we had to initially satisfy the requirement that all taxonomy ranks were of equal length.
An additional example… for instance, level D_4__
for one eukaryote may refer to a “Family” whereas that same level may refer to a “Super Family”, etc… Thus, we avoided using the standard rank annotation style of Greengenes. Hopefully, this makes some sort of sense.
However, if you are using the SILVA 7-rank taxonomy files, and are only concerned with Archaea & Bacteria, then you can relabel the D_0__
through D_6__
as Domain / Kingdom
through Species
without much trouble. Again, the issue had more to due with the wonkiness of the Eukaryote taxonomy.
I have since found a way, I think, to obtain “Greengenes-like” taxonomy strings, and I’ve uploaded some quite crude prototype code here. Note: this has not been thoroughly tested and vetted yet!
In brief, I realized that SILVA folks maintain a taxonomy tree that can be used to easily map, and extract, only those ranks we’d like to retain, by-passing all of the intermediate ranks such as “Sub-order”, etc…
We hope to leverage this approach to re-annotate the SILVA taxonomy strings in the future, we are still discussing and working out an approach for this. But if anyone would like to contribute to updating and/or testing an updated SILVA database (using this potential solution) please let us know.
There is your history lesson for the day.