I would just like to ask what the difference is between denoising and dereplication when used inside work-flows. I have been using vsearch and I have joined, quality filtered and dereplicated (as described in the overview), however I have NOT performed any denoising specifically, either using DADA2 or DeBlur. I know that DADA2 is an independent plugin therefore performed denoising, dereplication, filtering etc internally, however if I have dereplicated using vsearch dereplicate-sequences, do I still need to run additional denoising using deblur? Basically are the vsearch dereplicate-sequences and deblur denoise-16S commands homologous, and if not, what is the difference and should I therefore use both to achieve accurate dereplication and denoising?
In my view, denoising is a step where you try to infer the set of original amplicon sequences by analysing the error-containing sequences obtained by the combination of PCR/sequencing production steps. Dereplicating a sequence set is merely collapsing sequences at 100% of similarity, which does retain any sequences containing errors.
By using vsearch to join - quality filter - dereplicate your sequences, you obtained a non-redundant set of sequences common for all samples, which you now need to use as reference to align all the sequences on, in order to obtain an abundance table ("to see how many times each sequences is present in any sample"). You may probably use vsearch again for this step! At this point you may notice that there are plenty of sequences with very low total abundances, which may be due to sequences containing error, and you need to discard these before proceeding with the analysis.
Deblur and dada2 are two alternative tools designed to denoising the dataset as well as report how many times each original amplicon (feature in the final table) is present in each sample. Hence they produce two main outputs: the final feature table and the set of identified amplicons (you may see them as error-correcting the seuences -> dereplicating the error free sequences -> counting the abundance of the sequences in each sample).
So what does denoising with deblur achieve that I cannot achieve independently using Vsearch? Basically do I need to include additional denoising steps IF I have already quality filtered, dereplicated etc?
Both deblur and dada2 should return error-free sequences, which will be beneficial for the taxonomy assignment step. Also a smaller data set to handle after the denoising, because sequences with errors are not discarded but reverted to the original state. You could compensate this via filtering the low abundance OTUs, but you always taking the risk to discard real low abundant clusters as well by applying a filtering by abundance.
There are few interesting discussion on comparing OTUs (clusters obtained by vsearch, as in old qiime1) and ASVs/ESVs (amplicon sequence variants obtained by denoisers as dada2 or deblur), one is:
Another good one is:
Which remind me about chimera filtering steps embedded into deblur and dada2, that seems missing in your pipeline from the step you mention.
One question, what are you going to do next with the dereplicated sequences?
Keep in mind, working with ASVs or OTUs are both accepted way to get your results, so as long you are doing correctly they are both valid! Mostly is a matter of preference and knowing what you are working with!
Hi @taf1g17 ,
To add to @llenzi 's already excellent and comprehensive answer.
Denoising with Deblur/DADA2 requires you to operate on the raw fastq files. If you have already used vsearch for clustering/dereplication you can no longer denoise even if you wanted to. If you want (and I personally recommend this) is to use either Deblur/DADA2 to first denoise your raw reads, utilizing their superior quality controls and error correction, and then if you are interested in OTU clustering then you can also cluster the output from Deblur/DADA2 afterwards using vsearch as shown here. There's a video I made on this topic here which also goes a bit more in detail.