rescript evaluate-taxonomy terminal labels

Wondering what I've done wrong in feeding a taxonomy.qza file into qiime rescript evaluate-taxonomy to generate the output shown in the table below. The number of unique labels and entropy calculations seem fine, but I don't understand where I've failed with regards to the terminal labels and unclassified labels categories.

Level Unique_Labels Taxonomic_Entropy Terminal_Labels Proportion_Terminal_Labels Classified_Labels Proportion_Classified_Labels Unclassified_Labels Unclassified_Labels
1 1 0 0 0 24561 1 0 0
2 1 0 0 0 24561 1 0 0
3 14 1.492 0 0 24561 1 0 0
4 173 3.87 0 0 24561 1 0 0
5 1089 5.83 0 0 24561 1 0 0
6 6416 7.91 0 0 24561 1 0 0
7 17672 9.36 24561 1 24561 1 0 0

I would have expected unclassified labels at Species (4231), Genus (2515), Family(2172), Order (259), and Class (2). Yet these are not being reported.

My initial thought was that my taxonomy file I imported when creating this .qza object used incorrect rank handles. The initial file was structured like this:

Feature ID      Taxon
10013526        tax=k__Animalia;p__Chordata;c__Actinopterygii;o__Perciformes;f__Serranidae;g__Caesioperca;s__Caesioperca rasor
10013530        tax=k__Animalia;p__Chordata;c__Actinopterygii;o__Tetraodontiformes;f__Tetraodontidae;g__Contusus;s__Contusus brevicaudus
10013534        tax=k__Animalia;p__Chordata;c__Actinopterygii;o__Perciformes;f__Cheilodactylidae;g__Cheilodactylus;s__Cheilodactylus variegatus

so I then thought that maybe the tax= prefix to the Taxon field was causing the error. I removed that portion so that the next data set looked like this:

Feature ID      Taxon
10013526        k__Animalia;p__Chordata;c__Actinopterygii;o__Perciformes;f__Serranidae;g__Caesioperca;s__Caesioperca rasor
10013530        k__Animalia;p__Chordata;c__Actinopterygii;o__Tetraodontiformes;f__Tetraodontidae;g__Contusus;s__Contusus brevicaudus
10013534        k__Animalia;p__Chordata;c__Actinopterygii;o__Perciformes;f__Cheilodactylidae;g__Cheilodactylus;s__Cheilodactylus variegatus

and then reran the evaluate-taxonomy function.

Then I got the exact same result ... :confounded:

Thanks to @SoilRotifer @Nicholas_Bokulich and others for any help you can offer in trying to understand why the evaluate-taxonomy function isn't calculating the empty labels as I'd expect!

Hi @devonorourke,

Can you provide a few examples of some of the taxonomy strings that you expected to see reported?

-Mike

I was thinking that these kinds of taxonomy strings would produce a count for the Unclassified Labels category at various levels, no?

10129600	k__Animalia;p__Chordata;c__Actinopterygii;o__;f__;g__;s__
10129631	k__Animalia;p__Chordata;c__Actinopterygii;o__Perciformes;f__;g__;s__
10265400	k__Animalia;p__Chordata;c__Actinopterygii;o__;f__;g__;s__

If I remember correctly, I think you need to add the following to your command:

--p-rank-handle-regex "^[dkpcofgs]__"

-Mike

Thanks @SoilRotifer ! The addition of the --p-rank-handle-regex term resolved the issue.
What I'm wondering now is what the default taxonomy strings look like that don't require this added parameter?
Happy Thursday morning, by the way :coffee: :bacon: :fried_egg:

1 Like

I think the idea was to keep things as generalizable as possible. Some may choose not include prefixes in their labels, or use entirely different prefixes schemas. Since it is hard to work out all the possible combinations, we decided to leave this up to the user.

Basically, we want to make sure that when comparing taxonomies consiting of different label prefixes (e.g. SILVA and GTDB use d__ and GreenGenes uses k__) we remove the prefixes in order to compare them.

:coffee:

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.