DADA2 denoise-single chimera removal inconsistence

benjjneb · August 1, 2019, 7:04pm

OK I think I finally understand what is going on here.

As @ebolyen mentioned, the one stochastic element of the dada2 pipeline is that it uses a subset of the total samples to learn the error model. This is what introduces a dependence on sample order, as the first X samples are picked for error model learning. This subsetting is just for computational speed reasons, and if you set the n-reads-learn parameter high enough to use the whole dataset then the sample order dependence should go away.

So, the different manifests led to (slightly) different error models, that lead to (slightly) different denoising of the data. This didn't show up in the denoised total read stats, because until recently DADA2 would correct every read in the dataset, hence denoising does not change the total # of reads. However it did show up in the non-chimeric column, because one error model did a better job of distinguishing rare chimeric variants from the more abundant ASVs that they were generated from. Because those rare chimeras were identified as their own ASVs, they were then removed by the chimera filtering, while in the other dataset they were corrected to the non-chimeric ASVs that they were most similar too.

This happens because chimeras are often rare, and very close to a more abundant true ASV, which is the hardest situation to accurately discriminate. Thus modest changes in the error model can cause some of these chimeras to be identified or not, and you get the slight variation in total overall reads making it through the chimera removal stage.

Hope that makes sense! Overall I wouldn't worry about this though, this is really fiddling at the difficult to denoise "edges" of the data, and is unlikely to have much if any impact on total community signals.