concerns vsearch clustering speed

Hi,

I'm clustering ITS2 reads (denoised using DADA2) using vsearch:

qiime vsearch cluster-features-de-novo \
--i-sequences rep-seqs-dada2.qza \
--i-table table-dada2.qza \
--p-perc-identity 0.97 \
--p-threads 1 \
--o-clustered-table clustered_table.qza \
--o-clustered-sequences clustered_seq.qza

In my experience, clustering is often quite an computational intensive and long process.

However, this clustering step is extremely fast, even for ~200 samples with many 100K+ reads for each sample. See the dada2 output stats of the data before clustering:
dada2_output_stats_pre-cluster.tsv (11.7 KB)

I timed the vsearch command above using time and it took only 6 seconds on a single thread of a AMD® Ryzen 9 5900x CPU.

This seems so unlikely for so many samples and reads. Or does the .qza format make it very efficient?

I guess not many people complain about some processing step going to fast, but with this speed I get the feeling that it is actually not clustering properly.

So my question: is this a normal time for de novo clustering with vsearch of a quite extensive data set?

1 Like

Hi @Rob_DNA,

I suggest you read up on how userach / vsearch works. Different algorithms default for either speed or accuracy, with parameters to adjust them. For example the best case would be to perform an exhaustive search, but that may be untenable in some cases, and lead to days, or weeks of runtime.

Thus, tools like usearch / vsearch will, by default, not perform exhaustive searches and run very fast. They will operate by certain termination or stop criteria. These are often modified by the maxaccepts and maxrejects options. Once either of these are satisfied the search will stop, and then the next query search will begin.

Adjusting these values will help with better OTU counts and OTU table construction. That is, one issue with usearch / vsearch clustering, is that a sequence might be placed within an OTU seed just because it fits the match criteria (i.e. 97%) . However, there may be better OTU seed match for your query downstream (that the algorithm has not found yet). That may be an OTU seed might match your sequence at 99%, and should be placed within that OTU. You can place reads into incorrect OTUs, if the search criteria are not set properly, as the search will terminate before it finds the 'best' match.

That being said, this may not be an issue given your data! I am just pointing out some things to consider.

Anyway, You can read more about this here and here. The vsearch manual is here.

Sadly, the maxaccepts and maxrejects options are not currently available via qiime vsearch cluster-features-de-novo.... I believe the defaults for vsearch are maxaccepts=1 maxrejects=32.

2 Likes

Remember that de novo clustering was used to make OTUs right after quality filtering. So it was done on all 10-20 million reads on the Illumina run!

Yes, clustering is slow when you run it on millions of raw reads or thousands of dereplicated reads.

It's fast when you run it on a few hundred OTUs DADA2 output sequences.

It really is 'orders of magnitude faster than blast'

1 Like

thanks @SoilRotifer for the great insight!

Aha yes this makes absolutely sense! Thanks.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.