Is there an easy answer why vsearch derep can't support multithreading?

I’ve posted to the developers here the same question as the title, and at the moment all I know is that it’s not possible. I was curious if anyone within the QIIME community understood why qiime vsearch dereplicate-sequences doesn’t support multithreading (beyond that Vsearch doesn’t implement it)?

I might be able to provide some really basic insight.

A year or so ago I tested adding multithreading to our demux emp-* commands. What I found was that for the most part, there is little to no computation required and so the task is IO bound. This means adding more threads actually made performance worse because the drive was still providing data as fast as it could, but now there was extra logic to synchronize different threads.

Dereplicating needs a bit more logic that demultiplexing, but I wouldn’t be surprised if the vsearch devs ran into this same problem. Additionally you need a data structure of some variety to identify when you’ve encountered a duplicate sequence, and creating multithreaded versions of any datastructure is usually quite a bit more difficult.


Thanks @ebolyen,
The dev’s followed up with a more detailed response too. Sounds like you were on to something with performance, in addition to a secondary issue they mention in their reply on their Github Issues page.

1 Like