Minimum reads required for DADA2 error-model training

sbentley · April 23, 2019, 3:07am

Hello everyone,

My apologies if this has already been asked - happy to be redirected.

I will be processing data from MiSeq runs where each sample may be focused on a different target. For instance, samples 1-3 are targeting V1-V3, while samples 4-6 are targeting V4, sample 7 is ITS, etc.

From what I understand for DADA2 processing, I should split and group these samples based on the targeted region so I can control the trimming requirements for each, allowing for optimal overlap.

I am curious if anyone has bench-marked the minimum number of reads required to develop the error-model. For instance, sample 7 may be the only ITS sample, in which case I may need to increase its morality when pooling the samples to attempt to produce X amount of reads.

Any help in this would be greatly appreciated,
Kind Regards
Steve

Mehrbod_Estaki · April 24, 2019, 7:51am

Hi @sbentley,

https://benjjneb.github.io/dada2/tutorial.html#learn-the-error-rates

This has almost certainly been done though you would have to look towards the dada2 site/repo for the exact details. Somewhere on the forum (sorry I couldn't find the exact thread) the dada2 developer mentions that the default 1M is based on some prior benchmarking and there is only modest improvements beyond that, say up to 2Mil.
In your situation, it certainly would be important to check to see if the error model built on that one/handful of samples is a good fit. The native version of DADA2 in R has a nice visualization that lets you do this, as described in their documentation.
Hope that points you to the right direction!

sbentley · May 14, 2019, 12:32pm

Great - thanks for bringing this to my attention.

Cheers
Steve