RESCRIPt "super" dereplicate incorrect taxonomic behavior

Hi @Will_Rumfelt

After conferring with @Nicholas_Bokulich, it turns out the code is working as written / intended. Below is the result of our discussion, which ultimately describes how --p-mode super works:

Dereplication with super mode, looks for the majority annotation (from among all collapsed sequences) at each rank independently. BUT it also collapses substrings into superstrings at each rank.

Let's assume we have 3 identical reference sequences with the following annotations:

ID annotation
GMGMR1817-18 k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis
GMGMU2090-20 k__Animalia;p__Arthropoda;c__Insecta;o__Hymenoptera;f__;g__;s__
GMGMV5203-20 k__Animalia;p__Arthropoda;c__Insecta;o__Hymenoptera;f__;g__;s__

Using --p-mode super on this sequence set would indeed yield the following (bolding the order annotation that is wrong):
k__Animalia;p__Arthropoda;c__Insecta;o__Hymenoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis

Why? Because super mode is first binning the annotations at each rank to yield something like:
[k__Animalia, k__Animalia, k__Animalia]
...
[o__Coleoptera, o__Hymenoptera, o__Hymenoptera]
[f__Coccinellidae, f__, f__]
[g__Harmonia, g__, g__]
[s__Harmonia axyridis, s__, s__]

It then finds the superstrings at each rank independently:
[o__Coleoptera, o__Hymenoptera, o__Hymenoptera]
[f__Coccinellidae, f__Coccinellidae, f__Coccinellidae]
[g__Harmonia, g__Harmonia, g__Harmonia]
[s__Harmonia axyridis, s__Harmonia axyridis]

and then finds the majority annotation to yield the incorrect hybrid annotation above. Note that because the majority annotation has a bunch of empty ranks, the final annotation is "polluted" by a single reference annotation that does not have empty ranks at genus and species levels.

Intended use of --p-mode super:
super mode was written with the assumption that you (a) do not have misannotations in your database and (b) you do not have empty ranks, or at least these empty ranks are not the "correct" annotation. Super was written to try to "fill in the blanks" when these annotations exist and will never work if the "desired" annotation is one with empty ranks.

I hope this helps!

3 Likes