DADA2 error model with small dataset


QUESTION: What % of my reads should I be using for the DADA2 error model?

I've searched through the forums and seen two posts which redirect to this DADA2 R tutorial for information about error rates. That is helpful, if you have data ready to import into R. All my data lives on a server and before I dive into the rabbit hole of trying to figure out how to get it into R and remove the primers without QIIME2, I thought I'd see if anyone here had advice on how to approach my question.

DADA2 uses a default of 1,000,000 reads when training its error model. I am working with two datasets, one V4 and one V9 from the same samples. After using CUTADAPT, I have 788,563 reads in the V4 data and 875,454 reads in the V9 data, across 12 samples. Clearly the default number of reads DADA2 will use is greater than my samples, and it seems this is not ideal. :slight_smile:

I've looked for papers that discuss this, there doesn't seem to be much out there unless I've missed something.

Any help/suggestions are appreciated!

1 Like


Yes, indeed for Dada2 error model training 1 000 000 reads is recommended. However, I don't think that your results will be significantly affected by running Dada2 with less amount of reads. The datasets you are working with contain more than 700 000 reads and IMHO you should be fine.
You also can denoise your reads with Deblur that doesn't require 1 000 000 reads.