After conferring with @Nicholas_Bokulich, it turns out the code is working as written / intended. Below is the result of our discussion, which ultimately describes how --p-mode super
works:
Dereplication with super mode, looks for the majority annotation (from among all collapsed sequences) at each rank independently. BUT it also collapses substrings into superstrings at each rank.
Let's assume we have 3 identical reference sequences with the following annotations:
ID | annotation |
---|---|
GMGMR1817-18 | k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis |
GMGMU2090-20 | k__Animalia;p__Arthropoda;c__Insecta;o__Hymenoptera;f__;g__;s__ |
GMGMV5203-20 | k__Animalia;p__Arthropoda;c__Insecta;o__Hymenoptera;f__;g__;s__ |
Using --p-mode super
on this sequence set would indeed yield the following (bolding the order annotation that is wrong):
k__Animalia;p__Arthropoda;c__Insecta;o__Hymenoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis
Why? Because super mode is first binning the annotations at each rank to yield something like:
[k__Animalia, k__Animalia, k__Animalia]
...
[o__Coleoptera, o__Hymenoptera, o__Hymenoptera]
[f__Coccinellidae, f__, f__]
[g__Harmonia, g__, g__]
[s__Harmonia axyridis, s__, s__]
It then finds the superstrings at each rank independently:
[o__Coleoptera, o__Hymenoptera, o__Hymenoptera]
[f__Coccinellidae, f__Coccinellidae, f__Coccinellidae]
[g__Harmonia, g__Harmonia, g__Harmonia]
[s__Harmonia axyridis, s__Harmonia axyridis]
and then finds the majority annotation to yield the incorrect hybrid annotation above. Note that because the majority annotation has a bunch of empty ranks, the final annotation is "polluted" by a single reference annotation that does not have empty ranks at genus and species levels.
Intended use of --p-mode super
:
super
mode was written with the assumption that you (a) do not have misannotations in your database and (b) you do not have empty ranks, or at least these empty ranks are not the "correct" annotation. Super was written to try to "fill in the blanks" when these annotations exist and will never work if the "desired" annotation is one with empty ranks.
I hope this helps!