oh thanks for listing the file name — I see that this is a new release of UNITE. I have checked out this release and I see that it has 623 sequences with that annotation! I compared this to an older release I have laying around, and there are 0 sequences with that annotation.
So that explains a few things.
- we have not seen this particular issue in the past but probably because the new release contains this problematic sequence with a bad annotation, prior release(s) do not.
- the decision to extract or not may be particular to individual release versions (and individual databases for that matter)! Ultimately, it may be best to try things both ways to see what effect it has, since peculiarities of individual datasets may also impact this decision.
- Unless if you are happy with unhelpful results, I would recommend cutting out the dead wood (unannotated/unidentified sequences) from the UNITE database, and this is something I have done in the past.
Please let us know what you find!