DADA2 denoise-single chimera removal inconsistence

Hi! Thanks for quick reply! Here is the manifest and dada2 denoise summary files (I only upload part of these files not all)
first-manifest.csv (534 Bytes) second-manifest.csv (471 Bytes)
first-dada2-summary.tsv (411 Bytes) second-dada2-summary..tsv (352 Bytes)

Hi @Cotton, I can’t do anything with the partial manifests — can you send the whole ones? Also, can you send the QZAs for the DADA2 summaries, instead of the TSVs? This will allow us to interrogate the provenance, too.

Hi @thermokarst ! Here are the manifest.csv, demux.qzv, stats-dada2.qza files. Thank you!

first-manifest.csv (4.3 KB) first-demux.qzv (287.1 KB) first-stats-dada2.qza (10.3 KB)

second-manifest.csv (4.5 KB) second-demux.qzv (287.3 KB) second-stats-dada2.qza (10.3 KB)

Thanks so much @Cotton! To be honest, I have no clue what is going on here — time to call on @benjjneb! As best I can tell, no samples were missed or duplicated when you relabeled the Sample IDs. It looks like everything is identical between the two runs with the exception of the actual Sample IDs used. As far as I can tell, this shouldn’t have any bearing on the consensus protocol, but, maybe I am confused. @benjjneb, any clue why the exact same dataset, but with new IDs would produce different results? Thanks!

1 Like

Thank you @thermokarst :heart_eyes: . I have searched google hoping to find someone else who has the same problem like mine but … no one has.
I thought maybe something went wrong when importing data to QIIME2 . So I uncompressed these two demux.qza files to check the gunziped original files’ MD5. But sadly they are identical… So this problem is still a mystery :see_no_evil:

Huh. I don’t know! That is mysterious and shouldn’t be different.

I will look into it more, but am on vacation at the moment so it may be a few days.


Thank you and Enjoy your vocation @benjjneb :beach_umbrella: ! This is just an exercise so I’m not in a hurry. I’ll keep on finding differences between these two runs. Once I find the reasons that caused this inconsistence I’ll update this post.

1 Like

The possible reason is that DADA2 selects diiferent reads to train the error model (1000000 reads in default). You can add --verbose or -verbose in DADA2 command to find which samples are selected


Hi @nmgduan ! Thanks for your reply.
DADA2 does select partial reads to train the error model. I am confused about if the selection is random then every time you run DADA2 with the same input the output will be different. However I have re-ran DADA2 and the result is same as the former one which used the same demux.qza file as input (while two demux.qza files results in different outputs).
first-table-run1.qzv (742.7 KB) first-table-run2.qzv (742.7 KB)

And I quickly read DADA2 algorithm (DADA2: High-resolution sample inference from Illumina amplicon data | Nature Methods) it seems that the error model is used in dada() step which will affect the "denoised" column in dada2-summary.tsv not the "non-chimeric" column.

I'll add "--verbose" argument in DADA2 command to check this. Thanks for your advice!:heartpulse:

Maybe I should look at the R code of DADA2 to figure out what is going on in the "non-chimeric" step.

@thermokarst and I were talking about this briefly the other day. We’re stumped as well, but it may be the case that the error model is still too unstable at 1 million reads. This is a situation where the diagnostic plots that DADA2 produces could be really useful.

So a different file-ordering (from new names) results in a different sampling of reads for the error modeling, which results in different ASVs (but since trimming is the same, we will get the same number of reads at the denoising step), and these different ASVs are considered differently by the chimera checking.

If you ran DADA2 in R, there would be a point to check the error model, it’s possible you just need more reads to get it to converge correctly. (You could try 10 mil reads for example and see if that makes the two artifacts more concordant).

This was just our best guess though, we’re not really sure either. Our experience has been that DADA2 is exceedingly stable in it’s ASV choices so this is kind of weird all around.

Thank you so much @ebolyen @thermokarst ! I think your guess is quite reasonable. I’ll run DADA2 in R to try 10 mil reads and maybe more to train the error model. However, I have something else to do right now so maybe I’ll update this post a week later or longer. Sorry for the delay :sob:

1 Like

Quick update, there appears to be a sample-order dependence with the consensus chimera algorithm that we’ve never caught before. I haven’t tracked it down yet, but I think this is probably a bug of some kind so thanks for the report.

In my testing so far it appears to affect ASVs present in very few samples, so not likely to have much effect on final output, but we will be investigating further. Since this is an issue with the dada2 code, I opened a corresponding issue in the dada2 GitHub site which is where progress will show up first:


Well maybe a false alarm, the bug was actually in my testing code:

Now that I’ve fixed that, I’m so far unable to reproduce any sample-order dependence. @Cotton would you be able to share the sequencing data that you are processing with the Q2 plugin with me, so I can try to reproduce the behavior you are seeing?

OK I think I finally understand what is going on here.

As @ebolyen mentioned, the one stochastic element of the dada2 pipeline is that it uses a subset of the total samples to learn the error model. This is what introduces a dependence on sample order, as the first X samples are picked for error model learning. This subsetting is just for computational speed reasons, and if you set the n-reads-learn parameter high enough to use the whole dataset then the sample order dependence should go away.

So, the different manifests led to (slightly) different error models, that lead to (slightly) different denoising of the data. This didn’t show up in the denoised total read stats, because until recently DADA2 would correct every read in the dataset, hence denoising does not change the total # of reads. However it did show up in the non-chimeric column, because one error model did a better job of distinguishing rare chimeric variants from the more abundant ASVs that they were generated from. Because those rare chimeras were identified as their own ASVs, they were then removed by the chimera filtering, while in the other dataset they were corrected to the non-chimeric ASVs that they were most similar too.

This happens because chimeras are often rare, and very close to a more abundant true ASV, which is the hardest situation to accurately discriminate. Thus modest changes in the error model can cause some of these chimeras to be identified or not, and you get the slight variation in total overall reads making it through the chimera removal stage.

Hope that makes sense! Overall I wouldn’t worry about this though, this is really fiddling at the difficult to denoise “edges” of the data, and is unlikely to have much if any impact on total community signals.


Thank you so much @benjjneb ! Now I understand why there is a difference when change the order of my sampleIDs. And I agree that this difference is not a big problem.

I have sent you an e-mail attached with the imported sequencing data on Monday. But I am not sure whether I can send e-mails with large files (>500M) to Gmail in China. Maybe I have been blocked… Sorry for the delay and it seems you do not need the data anymore.

And I have to say sorry to @ebolyen . I was supposed to install the latest R and dada2 quickly and finish that analysis. However, I have been struggling with updating R and installing dada2 via BiocManager for four days. I feel so frustrated. But I promise once I
solve this I will change the order of sampleID and increase the error model reads number to affirm Ben Callahan’s conclusion.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.