A quick clarification on clustering

km4htc · September 14, 2017, 4:38pm

Hi all,

I'm trying to better understand the differences between how uclust (qiime1 default), deblur, and dada2 pick otus. I see from the deblur github page that sortmerna, vsearch, and mafft are all wrapped in deblur, but I don't see which is the default in the deblur qiime2 plugin. Is dada2 a standalone plugin, or does it wrap other tools as well? Do both deblur and dada2 use quality score information and do both pick otus denovo?

In short, could someone provide some information about how these different clustering methods actually work? I haven't made much progress improving my understanding through various readme pages and original papers.

Thanks!!

ebolyen · September 14, 2017, 8:18pm

Hi @km4htc!

Both DADA2 and Deblur are denoising algorithms, and they produce what we would call Amplicon Sequence Variants (ASVs). You can think of them like a 100% OTUs, however instead of trying to cluster to indirectly limit error, sequencing error is directly modelled and corrected for. Ultimately what this means, is you are operating directly on every (error corrected) sequence in a sample.

It might seem like this would result a great many more features than you would have in QIIME 1, however it turns out in practice that much of the variation you see is a result of sequencing error, which denoising accounts for. This means you actually generally end up with fewer sequences which are of a much higher quality than you would have with say 97% OTUs which is kind of a win-win!

This text is a great explanation outlining some of the advantages of ASVs over cluster-based OTUs if you'd like more details.

My understanding is that these tools are all used to identify sequencing error (and filter out unrelated sequences). In other words, it's not so much that you pick any one of them, they are used together.

DADA2 is a standalone R library and can be used on it's own (it is largely C++ with R bindings). We work with @benjjneb (the author) to wrap that functionality in q2-dada2 (the QIIME 2 plugin) making it accessible to QIIME 2 interfaces.

Yes, the quality information is very important to this process.

In the future we will be supporting OTU picking methods such as closed-reference and de-novo clustering. However even in these cases, you'll probably get better results using ASVs over the raw reads (we plan on supporting both use-cases). Ultimately ASVs could be seen as complementary to this process, and de-novo clustering/closed-reference OTU picking becomes more of an optional path you can take in your analysis instead of something fundamental like it was in QIIME 1.

km4htc · September 18, 2017, 5:04pm

Thanks so much! That's a very helpful--and quick--orientation!!