Are OTU numbers reproducible in vsearch?

Hi everyone,

In qiime1, using the exact same data and parameters, usearch produces different OTU numbers across runs. I’m considering switching to qiime2 (which uses vsearch instead) and would like to know if this phenomenon still persists? What’s the explanation for this irreproducibility? I tried looking on forums and read the original usearch paper by Edgar but come up empty handed.

An explanation would be appreciated!

Hi Steve,

Could you clarify what you mean by OTU number?

Are you referencing the identifier, the identifier/sequence combination, the actual number of counts recruited to each OTU, or the number of unique sequences?



Hi Justine,

Sorry, I might not have been clear! In qiime1 using the exact same data, I could get 10000 unique OTUs on one occasion, and then 10012 unique OTUs on another occasion, and 10007 OTUs on another occasion etc.

Roughly speaking, after filtering chimeras, I would run:

usearch -derep_fulllength
usearch -sortbysize
usearch -clusterotus

I had collaborators ask me about this and I was not able to explain why after searching everywhere. I eventually told them that usearch -clusterotus must have a stochastic component to it (I’m not sure that’s right but sounds plausible).

Now qiime2 uses vsearch, which is based on usearch. So I assume the same problem persists? Did qiime1 (or should I say Edgar) ever give an explanation for why there’d be different OTU clusters across instances? I know his code isn’t open source, but surely someone has brought up this irreproducibility issue before.


PS. To answer the question, it would be the actual number of counts recruited to each OTU. If OTU_A and OTU_B have 1 and 2 counts respectively in one instance. But then I rerun the code, I could have OTU_A and OTU_B with 0 and 3 counts respectively. This would change the number of unique OTUs from 2 to 1. It would make sense the problem arises at the clustering stage.

Hi Steve,

My suspecion (not seeing the code) is that it potentially has to do with the way the seed sequences for clustering are selected. I think, thought, USearch goes through and compares against the OTUs until it finds one that matches. It may be possible that there’s a better match later, which might explain the difference if the sequence ordering is different different places. But, again, closed algorithm, Im not sure.

As far as vsearch performance, I checked their help page, and didn’t see a stability issue come up. You may want to do a deeper dive there, though. Sorry if this isn’t as direct of an answer as you might like, and more of a new direction.

However, if you’re doing do novo clustering (and it sounds like you are, if your OTUs are unstable?) you might want to consider a subOTU method, if you’re working with 16s. It gives you sequence level resolution, which should be more reproducible and robust than clustering. This (of course) fails whether other de novo methods fail, like cross-hypervariable region comparisons.

Hope this is semi helpful

1 Like

Hi Justine,

Thanks for the response! I thought about the seeding sequence as well, but with the usearch -sortbysize command, the order should always be the same, so I’m not sure if that explains the issue.

So in qiime2, is the subOTU method the default implementation? Or is that something that one would have to configure manually? Having skimmed over the literature, it sounds like subOTU is just a fancy name for read alignment, or am I mistaken?

Hi Steve,

Ummm. Im a bit stumped then, with the clustering. Sorry. I would be really curious if your centroids remain constant, though.

The architecture with QIIME 2 is a bit different from QIIME 1. There aren’t workflow commands in the same way, like with, etc. Instead, you select a specific workflow and each is a command. So, your de novo vSearch OTU picking would be called with something like

qiime vsearch cluster-features-de-novo

And you’d denoise using deblur with

qiime deblur denoise-16S

Deblur, Dada2 and Vsearch are all part of the vanilla QIIME 2 plugins with the latest release.

They are similar to alignment, with a component that essentially infers an error profile, and either removes (deblur) or corrects (dada2) for the errors.

Hey there @SteveMcL - I moved this topic into “Other Bioinformatics Tools” since this really seems to be about vsearch, and not QIIME 2 itself, please let me know if I have misunderstood. Thanks!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.