HI @David_Bradshaw,
The dereplicate
action was initially set up to handle taxonomies with only the standard 7-ranks (i.e. dpcofgs). That is, if any taxonomy was truncated at a higher level, we'd backfill them with the corresponding prefixes, e.g. f__; g__; s__
.
For example, this:
KJ763795.1.1805 d__Eukaryota; k__Alveolata; p__Dinoflagellata; c__Dinophyceae; o__Gymnodiniphycidae
would become this:
KJ763795.1.1805 d__Eukaryota; k__Alveolata; p__Dinoflagellata; c__Dinophyceae; o__Gymnodiniphycidae; f__; g__; s__
It appears you are leveraging all the available SILVA taxonomy. In which case, the taxonomy rank backfilling of the prefixes will not work. We should probably update the LCA functionality so that it'll backfill using any number / combination of taxonomic ranks.
I'd suggest you stick with using the uniq
option for now (keeps identical sequences with uniq taxonomic ranks), and let the classifier handle working out the taxonomic assignment. The classifier will, in effect, perform an LCA when it is unable to disambiguate very similar / identical sequences with differing taxonomy.