In my experience, clustering is often quite an computational intensive and long process.
However, this clustering step is extremely fast, even for ~200 samples with many 100K+ reads for each sample. See the dada2 output stats of the data before clustering: dada2_output_stats_pre-cluster.tsv (11.7 KB)
I timed the vsearch command above using time and it took only 6 seconds on a single thread of a AMD® Ryzen 9 5900x CPU.
This seems so unlikely for so many samples and reads. Or does the .qza format make it very efficient?
I guess not many people complain about some processing step going to fast, but with this speed I get the feeling that it is actually not clustering properly.
So my question: is this a normal time for de novo clustering with vsearch of a quite extensive data set?
I suggest you read up on how userach / vsearch works. Different algorithms default for either speed or accuracy, with parameters to adjust them. For example the best case would be to perform an exhaustive search, but that may be untenable in some cases, and lead to days, or weeks of runtime.
Thus, tools like usearch / vsearch will, by default, not perform exhaustive searches and run very fast. They will operate by certain termination or stop criteria. These are often modified by the maxaccepts and maxrejects options. Once either of these are satisfied the search will stop, and then the next query search will begin.
Adjusting these values will help with better OTU counts and OTU table construction. That is, one issue with usearch / vsearch clustering, is that a sequence might be placed within an OTU seed just because it fits the match criteria (i.e. 97%) . However, there may be better OTU seed match for your query downstream (that the algorithm has not found yet). That may be an OTU seed might match your sequence at 99%, and should be placed within that OTU. You can place reads into incorrect OTUs, if the search criteria are not set properly, as the search will terminate before it finds the 'best' match.
That being said, this may not be an issue given your data! I am just pointing out some things to consider.
Anyway, You can read more about this here and here. The vsearch manual is here.
Sadly, the maxaccepts and maxrejects options are not currently available via qiime vsearch cluster-features-de-novo.... I believe the defaults for vsearch are maxaccepts=1 maxrejects=32.