vsearch cluster-features-de-novo is taking too long

Hi everyone,
I am trying to cluster ITS1 into OTUs using two level of similarity (0.98 and 100) running vsearch cluster-features-de-novo on a server. This process is taking to long (more than 24 hours) and the command have been automatically stopped by the server for going over the time limit I have.
I did not had this problem while I was doing ASVs with DADA2.
It seems strange to me and I was wondering if there is any way to speed up the process.

Other questions I have are if:
-cluster-features-closed-reference or cluster-features-opened-reference may be faster?
-Where I can download --i-reference-sequences for ITS1?

Thank you all for your help!


Hi @Antani,

That is certainly odd. Have you dereplicated the sequences prior to de novo clustering?

You can download ITS reference data from UNITE and import into :qiime2:


Hi @SoilRotifer
Yes, I dereplicate sequences before clustering.

Thank you for the advice, I am now running -cluster-features-closed-reference using UNITE data but the process is still taking a lot of time.

Do you have any other suggestions?
Thank you so much for your help!

Not at the moment. How many sequences are you trying to cluster?

From the Demultiplexed sequence counts summary I have a total number of 40,355,818 sequences.

running qiime vsearch cluster-features-closed-reference with percentage of identity = 1 the process ended correctly after ~ 48 hours. Now I have the same command running with a different percentage (0.98). It seems to be working properly and that it was just a matter of time.

Thank you so much for your support!

I had forgotten to ask, how many unique Features (rep-seqs) are there in the dereplicated feature-table?

You do not necessarily need to cluster at an identity of 1 if the sequences are dereplicated. In fact, that is the definition of dereplication, only keeping unique sequences (i.e. 100 % identity). You may loose a few features when clustering with vsearch, due to how the clustering accepts / rejects algorithm works. But I just wanted to let you know. :sunny:

It still seems odd that it's taking so long. You are clustering the dereplicated feature-table correct? Not the demultiplexed data that contains 40 million reads. If so, that is why it is taking so long. If not, then I'd suspect that there are many highly similar features and it is having trouble clustering them... :man_shrugging:

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.