Hi @prince,
To clarify, we do not curate the taxonomy, we are simply parsing the files as provided by SILVA. As for the upper-level taxonomy being propagated downward... this is intended by me. The reason for this is that not all reference sequences have taxonomic annotations for each rank. When a particular rank is missing, in this case the family level, the taxonomic rank above is propagated downward to a lower rank position until it finds a rank at a lower level, then it propagates that rank downwards towards, and so on... to the genus or species level. For example:
d__Bacteria; p__Firmicutes; c__Clostridia; o__Peptostreptococcales-Tissierellales; f__Peptostreptococcales-Tissierellales; g__Peptoniphilus; s__Peptoniphilaceae_bacterium
Note that the o__
rank is propagated down to f__
, but we found rank information for g__
and s__
.
This is propagation of rank information is a convention followed by other research groups and tool developers too. This makes it easier to meet the requirements of various taxonomy classifiers (e.g. some tools require that all reference sequences have the same number of ranks).
While this might be technically correct, this would make parsing taxonomy onerous. Downstream analyses may fail as there would be two o__
levels. This is why we prepend each rank with o__
, f__
, etc... to provide some level of unique rank-level information.
Basically consider these annotations as saying: "We are using the name Clostridia_UCG-014
to fill the f__
slot, and again for the g__
slot." This is how these taxonomy annotations should be interpreted, e.g. there are many cases of s__gut_metagenome
, which is not a legitimate taxonomic rank in its own right. So, the annotation gut_metagenome
is being used to fill-in the s__
slot.
This is not perfect, but what we have to work with... Curating taxonomy is hard work, which is why we greatly appreciate those that do it!
As for the ;__
, there are a few answers to this on the forum, but here is one explanation:
-Hope this helps!
-Best wishes.