I'm still pretty new to most of this. I'm using cutadapt to remove primers prior to running dada2 denoise paired (samples are paired end, demultiplexed from three separate Illumina runs). I'm running cutadapt on all sequences, but separating them by run for denoising. My decisions are based on the following assumptions (that I can't actually find explicitly stated, so please let me know if these are false):
-cutadapt allows for some error when finding and removing primers (default error rate is 0.1 and allows for in/del and mismatches). Also finds and removes partial primers when anchored to the 5' end with a min overlap of 5 bases. These parameters allow for a more thorough removal of primers than using the -trim function
-when running cutadapt I also filter by length allowing for reads 10% shorter and 10% longer than the expected read length once primers are removed (I'm using the 2x251 kit). I'm not sure about this as I've read some posts that caution not to preform any other QC (other than removing primers) prior to denoising. Is this a bad idea?
I'm also curious about allowing for some variability in the read length. I'm assuming that even a single nucleotide difference between reads will lead to separate feature calling (which is why when separating samples from different runs the trim and trunc parameters must be the same if the resulting feature tables are to be merged). Does the dada2 error model generation take into account some read length variability? Should I just chop of the first 19(20) (515F, 926R) bases and not account for other variability here?
I also can't seem to find the details of what Dada2 is doing implicitly when called in qiime2. Does it still generate the error model (if so, can this be visualized for assessment like when following the Dada2 pipeline in R), is chimera checking and dereplication also completed with dada2 denoise-paired
Lastly, is there anything in the denoising step similar to maxEE in Dada2 filtering: the quality filtering threshold being applied based on the expected errors.
Thank you very for any insight and apologies for the long question.
Good to hear from you again. I think you are on the right track!
Correct. Those settings sound OK to me.
Not a bad idea, but maybe not needed. I think the idea is to let denoising happen with a much data as possible to see what it can recover, then trim/filter afterwards. But simply removing data before denoising shouldn't cause problems.
To the source code!
The denoise_paired() function calls the run_dada_paired.R script. You can find the chimera checking and the dereplication inside those scripts, along with the default settings passed to the DADA2 R functions.
There are the --p-max-ee-f and --p-max-ee-rflags. Is that what you are asking about?
Thanks again for such thorough and useful replies @colinbrislawn ! Those are indeed the flags I am looking for. Just to make sure I'm keeping things straight: the default error in cutadapt of 0.1 pertains to matching the specified primer sequence allowing for 10% error while the --p-max-ee-f and --p-max-ee-r flags specify quality filtering thresholds being applied based on the expected errors based on Phred scores. The number entered is equal to the min number of erroneous base calls allowable. This filtering is completed prior to generating the error model. Thanks!
During denoising sequences that differ solely by variation in length will be grouped together. Pure length variation is not part of the error model.
However, quality control that introduces artefactual length variation (e.g. by trimming when quality drops below some threshold) does tend to reduce the sensitivity of denoising, and thus is not recommended. Instead truncating at a fixed length cutoff is typically the better approach.
Thanks! Why would QC that introduces artefactual length variation (ie using --p-trunc-q 18 using the threshold suggested by Mohsen et al. 2019 ) reduce the sensitivity of denoising whereas truncation at a fixed length cutoff based on the q-score plots doesn't - when pure length variation is not part of the error model? Wouldn't the allowance for some read length variability during primer removal with cut adapt be completely 'erased' (can't think of a better term) by aligning all reads at the 5' end and then bluntly truncating at the 3' end?
For Cutadapt, the adapter alignment algorithm uses unit costs instead. This means that mismatches, insertions and deletions are counted as one error, which is easier to understand and allows to specify a single parameter for the algorithm (the maximum error rate) in order to describe how many errors are acceptable.
Correct. Or I would say 'max number of expected error-ful bases allowable'.
Correct. Looking at the R script, trim and filter is the first step.
The ASVs inferred to be real during denoising are "seeded" by groups of exactly identical reads. At this step, length variation matters. Once the seed of a new ASV is deemed real (i.e. improbable under the error model) it will recruit other reads that differ only by length variation, and they will all be collapsed together at the end. However, the smaller the number of reads in that initial seed, the less statistical evidence there is to infer that new ASV in the first place. As a result, reduced sensitivity especially to rare variants, and the reason for our recommendation to avoid introducing artefactual length variation.