Would anyone be able to help clear my confusion on sequence clustering in qiime2 please? In mothur, sequence clustering is done in two steps. The first is a precluster step where very similar sequences are clustered using pre.cluster step. This is to further denoise the sequences. It will split the sequences by group and then sort them by abundance and go from most abundant to least and identify sequences that are within 2 nt of each other (if diiffs is set to 2). If they are, then they get merged. This can drastically reduce the number of unique sequence count. The second is the real cluster step with cluster.split command, where sequences are clustered based on the distances (how difference they are from each other) calculated previously, like 3% different or 1% different.
(1) Is clustering a must-have step for general 16s data analysis ?
(2) In qiime2, is the clustering already done by the dada2 denoising step or the dada2 only performs the aforementioned pre.cluster step and a further clustering still necessary?
(3) if further clustering is needed, is the q2-vsearch the one that can do the similar job? any other options?
(4) Or, the q2-vsearch is equivalent to dada2 denoising in terms of sequence clustering?
(5) dereplicating here in qiime2 is actually clustering sequencing, right?
Thank you in advance. Looking forward to any inputs.
Good questions. In addition to my answers below, I recommend that you check out the online tutorials at https://qiime2.org; these will give you a sense of the possible workflows with QIIME 2, and what a “normal” workflow would look like with denoising vs. OTU clustering methods.
Same in QIIME 2 — but the first step is to dereplicate (with
vsearch dereplicate-sequences) instead of pre-clustering sequences with ≤ 2 nt difference. See the OTU clustering tutorial on the QIIME 2 website for more details.
No! Definitely not. In fact, we generally encourage the use of denoising methods like dada2 or deblur in place of OTU clustering. While OTU clustering has its place and can be useful for certain experimental questions, denoising is generally a better approach for removing erroneous sequences (see the dada2 and deblur papers for benchmarking evidence). It is possible to cluster denoised ASVs into OTUs, but in most use cases this will only reduce your resolution and is not recommended.
see the dada2 article for more details. dada2 is doing so much more than pre-clustering. It is detecting and correcting or removing errors in the sequences, then dereplicating the unique sequence variants. Further clustering is NOT necessary (but is an option for users who still want OTUs).
They are NOT equivalent. q2-vsearch will just perform classic OTU clustering. dada2 is a proper denoising method. See the dada2 paper for more details.
No, dereplicating is dereplicating. It is collapsing all replicates into single representative sequences.
I hope that helps clarify!
Thank you very much for the clarification. The tutorial for qiime2 plugin workflow is also very helpful. Thanks again!
Have a great day.