Hi everyone,
I am trying to cluster ITS1 into OTUs using two level of similarity (0.98 and 100) running vsearch cluster-features-de-novo on a server. This process is taking to long (more than 24 hours) and the command have been automatically stopped by the server for going over the time limit I have.
I did not had this problem while I was doing ASVs with DADA2.
It seems strange to me and I was wondering if there is any way to speed up the process.
Other questions I have are if:
-cluster-features-closed-reference or cluster-features-opened-reference may be faster?
-Where I can download --i-reference-sequences for ITS1?
From the Demultiplexed sequence counts summary I have a total number of 40,355,818 sequences.
running qiime vsearch cluster-features-closed-reference with percentage of identity = 1 the process ended correctly after ~ 48 hours. Now I have the same command running with a different percentage (0.98). It seems to be working properly and that it was just a matter of time.
I had forgotten to ask, how many unique Features (rep-seqs) are there in the dereplicated feature-table?
You do not necessarily need to cluster at an identity of 1 if the sequences are dereplicated. In fact, that is the definition of dereplication, only keeping unique sequences (i.e. 100 % identity). You may loose a few features when clustering with vsearch, due to how the clustering accepts / rejects algorithm works. But I just wanted to let you know.
It still seems odd that it's taking so long. You are clustering the dereplicated feature-table correct? Not the demultiplexed data that contains 40 million reads. If so, that is why it is taking so long. If not, then I'd suspect that there are many highly similar features and it is having trouble clustering them...