Running DADA2 on pre-processed data

Even if you are not observing drops in quality, does not mean the sequences are free of errors. It just means that there are less errors. Let's assume, in a magical world, that all of the base calls for all sequences are Q40. This means you have 1 error in 10,000 base calls across all of your data. That is, you can have a high quality score, or high confidence, in an incorrect base call. Thus, the longer your region of overlap for paired reads, inherently increases the chances of a mismatch. Once you go over the mismatch threshold the merge will fail. In this example it would be rare, and you'd loose some sequences, but likely not very significant.

I often like to aim for ~ 20-30 bases of overlap. By default DADA2 expects ~ 12 bases of overlap. In order to find a good ball-park range for you truncation parameters you can estimate how much overlap there is in your reads. I am not sure which primer set you are using for your V3V4 data, but I'll assume (as does this post), that you are using the 341F-805F primers. Thus:

  • 805-341 = ~464 bp amplicon length
  • Estimated overlap using 2x250 would be: 2x250 - 464 = 36bp.

If you search the forum for "calculate amplicon length overlap" or similar, you'll come across several forum threads.

Given the above, you should be good to go. You could then simply adjust your truncation parameters such that your overlap does down to ~20 bp. But that often is not necessary. However, I've had a few instances were reducing the overlap to this level, even with "good" data, helped substantially. But your mileage may vary.

Finally, do keep in mind that the above calculation can be affected after primer removal. For more details on that see this post.

3 Likes