How to find the appropriate value of expected errors(--p-max-ee)?

1116 · December 4, 2018, 5:02pm

Hi, everybody! I need to conduct a 16s rRNA metagenomic analysis. My problem is that dada leads to a big loss of my data, even if I don’t do trimming at all. For example, in the first sample of 192,000 reads, only 21,000 remained after filtering, moreover, after denoising step, only 17,000 remained and this is the largest sample! I studied the algorithm and believe that the problem is that the criterion of expected errors(--p-max-ee) is too strict. How can I calculate the appropriate value of expected errors for my data set?(2 by default). My data are pair-end.

Nicholas_Bokulich · December 4, 2018, 7:32pm

This is probably because you aren't trimming enough, not in spite of the lack of trimming. There are tons of posts on this forum about selecting appropriate trimming parameters for paired-end reads (I assume that's what you have). Take a look at those — you need to trim enough so that you remove noisy segments while still having enough to merge.

Calculate the appropriate threshold by performing extensive benchmarking with samples of known composition, e.g., mock communities, to find appropriate levels. That's what the authors of dada2 did in the original publication, I believe to land on the current default. If you feel this is too strict and want to re-optimize, you will need to repeat those steps. Or just pick a number that you feel is acceptable.

Good luck!