I am running this code (below) for my paired-end sequences data (16S, V4).
The output (denoise stats) shows filtering, denoise, merging, and chimera.
What are the parameters for merging and chimera checking? I do not see it in this command (below). How am I getting those results, merging and chimera?
Can you please explain what denoise with the data2 command?
I think you would be able to get a much more complete understanding by reading the DADA2 paper and then checking out the DOCS. Then hop back on here and we can answer any more specific questions you might have at that point
It sounds like you are asking about how DADA2 goes about choosing which reads it keeps and which it discards during the denoising process. Is this correct?
If so, it may require a bit more time to get concise answer together, in the mean time though I would look back over the DADA2 paper and also watch this Denoising video from a recent workshop that gives a good overview of the process. If neither of these answer your question I will get back to you with a more detailed answer
Here is a brief overview of the steps that DADA2 uses to produce its results:
A pairwise sequence comparison is performed on sequences that are part of the same kmer cluster.
An error model is run that calculates how likely it is that slightly differing sequences are caused by error vs actual differences in the sequence.
A statistical test to determine if the number of occurrences of a particular nucleotide in a sequence are statistically likely to occur in an actual sequence.
A divisive partitioning algorithm is then run, where all similar sequences are placed into a partition, then an algorithm is used that compares each sequence in the partition to the "center" of the partition. If it is too far apart, a new partition is created or if they are similar enough they are left together. This algorithm is described in more detail here.
Once the partitions are inferred, an error model parameterization step occurs where the likelihood of any mismatches between a sequence and center of the partition are calculated and these values are stored in a table that is used to estimate parameters in the error model.
Finally, the algorithm alternates between sample inference and parameter estimation until a consistent result emerges.
Chimeras are detected by performing an alignment between a less common sequence and a more common one and then finding where the more common sequence would have to align with some other sequence to produce the less common sequence.