DADA2 denoise-single chimera removal inconsistence

Hi to all!
I’m running QIIME2 2019.1 which is installed by anaconda on a CentOS7 server. I meet a very weird problem.

I have some single-end demultiplexed 16S amplicon sequencing data. Then I imported the data twice by using twice different manifest.csv files (the two manifest files are only different in “sample-id”, while the “absolute-filepath” and “direction” are identical). And I used the same dada2 command to do denoise respectively (the code is showed below).

qiime dada2 denoise-single \
  --i-demultiplexed-seqs demux.qza \
  --p-trim-left 0 \
  --p-trunc-len 0 \
  --p-n-threads 10 \
  --o-representative-sequences rep-seqs.qza \
  --o-table table.qza \
  --o-denoising-stats stats-dada2.qza

However the Number of features and total frequency in two table.qzv files are different. So I checked the stats-dada2.qzv files and I found that “input”, “filtered”, “denoised” columns are same while “non-chimeric” columns are different.

non-chimeric-first non-chimeric-second
181663 180676
175891 174957
175152 174243
155349 154919
158265 157775
165684 165071
158502 158241
132739 132301
141902 141458
148559 147940
139161 138206
127764 126603
141785 140971
137757 137437

I re-compute the same command showed above using these two demux.qza files and the results are still different. So what may cause this inconsistence?:joy:

Thanks in advance!

Can you please share the two manifest files?

1 Like

Hi! Thanks for quick reply! Here is the manifest and dada2 denoise summary files (I only upload part of these files not all)
first-manifest.csv (534 Bytes) second-manifest.csv (471 Bytes)
first-dada2-summary.tsv (411 Bytes) second-dada2-summary…tsv (352 Bytes)

Hi @Cotton, I can’t do anything with the partial manifests — can you send the whole ones? Also, can you send the QZAs for the DADA2 summaries, instead of the TSVs? This will allow us to interrogate the provenance, too.

Hi @thermokarst ! Here are the manifest.csv, demux.qzv, stats-dada2.qza files. Thank you!

first-manifest.csv (4.3 KB) first-demux.qzv (287.1 KB) first-stats-dada2.qza (10.3 KB)

second-manifest.csv (4.5 KB) second-demux.qzv (287.3 KB) second-stats-dada2.qza (10.3 KB)

Thanks so much @Cotton! To be honest, I have no clue what is going on here — time to call on @benjjneb! As best I can tell, no samples were missed or duplicated when you relabeled the Sample IDs. It looks like everything is identical between the two runs with the exception of the actual Sample IDs used. As far as I can tell, this shouldn’t have any bearing on the consensus protocol, but, maybe I am confused. @benjjneb, any clue why the exact same dataset, but with new IDs would produce different results? Thanks!

1 Like

Thank you @thermokarst :heart_eyes: . I have searched google hoping to find someone else who has the same problem like mine but … no one has.
I thought maybe something went wrong when importing data to QIIME2 . So I uncompressed these two demux.qza files to check the gunziped original files’ MD5. But sadly they are identical… So this problem is still a mystery :see_no_evil:

Huh. I don’t know! That is mysterious and shouldn’t be different.

I will look into it more, but am on vacation at the moment so it may be a few days.

2 Likes

Thank you and Enjoy your vocation @benjjneb :beach_umbrella: ! This is just an exercise so I’m not in a hurry. I’ll keep on finding differences between these two runs. Once I find the reasons that caused this inconsistence I’ll update this post.

1 Like

The possible reason is that DADA2 selects diiferent reads to train the error model (1000000 reads in default). You can add --verbose or -verbose in DADA2 command to find which samples are selected

2 Likes

Hi @nmgduan ! Thanks for your reply.
DADA2 does select partial reads to train the error model. I am confused about if the selection is random then every time you run DADA2 with the same input the output will be different. However I have re-ran DADA2 and the result is same as the former one which used the same demux.qza file as input (while two demux.qza files results in different outputs).
first-table-run1.qzv (742.7 KB) first-table-run2.qzv (742.7 KB)

And I quickly read DADA2 algorithm (https://www.nature.com/articles/nmeth.3869) it seems that the error model is used in dada() step which will affect the “denoised” column in dada2-summary.tsv not the “non-chimeric” column.

I’ll add “–verbose” argument in DADA2 command to check this. Thanks for your advice!:heartpulse:

Maybe I should look at the R code of DADA2 to figure out what is going on in the “non-chimeric” step.

@thermokarst and I were talking about this briefly the other day. We’re stumped as well, but it may be the case that the error model is still too unstable at 1 million reads. This is a situation where the diagnostic plots that DADA2 produces could be really useful.

So a different file-ordering (from new names) results in a different sampling of reads for the error modeling, which results in different ASVs (but since trimming is the same, we will get the same number of reads at the denoising step), and these different ASVs are considered differently by the chimera checking.

If you ran DADA2 in R, there would be a point to check the error model, it’s possible you just need more reads to get it to converge correctly. (You could try 10 mil reads for example and see if that makes the two artifacts more concordant).

This was just our best guess though, we’re not really sure either. Our experience has been that DADA2 is exceedingly stable in it’s ASV choices so this is kind of weird all around.

Thank you so much @ebolyen @thermokarst ! I think your guess is quite reasonable. I’ll run DADA2 in R to try 10 mil reads and maybe more to train the error model. However, I have something else to do right now so maybe I’ll update this post a week later or longer. Sorry for the delay :sob:

1 Like

Quick update, there appears to be a sample-order dependence with the consensus chimera algorithm that we’ve never caught before. I haven’t tracked it down yet, but I think this is probably a bug of some kind so thanks for the report.

In my testing so far it appears to affect ASVs present in very few samples, so not likely to have much effect on final output, but we will be investigating further. Since this is an issue with the dada2 code, I opened a corresponding issue in the dada2 GitHub site which is where progress will show up first: https://github.com/benjjneb/dada2/issues/806

2 Likes

Well maybe a false alarm, the bug was actually in my testing code: https://github.com/benjjneb/dada2/issues/806#issuecomment-511542586

Now that I’ve fixed that, I’m so far unable to reproduce any sample-order dependence. @Cotton would you be able to share the sequencing data that you are processing with the Q2 plugin with me, so I can try to reproduce the behavior you are seeing?

OK I think I finally understand what is going on here.

As @ebolyen mentioned, the one stochastic element of the dada2 pipeline is that it uses a subset of the total samples to learn the error model. This is what introduces a dependence on sample order, as the first X samples are picked for error model learning. This subsetting is just for computational speed reasons, and if you set the n-reads-learn parameter high enough to use the whole dataset then the sample order dependence should go away.

So, the different manifests led to (slightly) different error models, that lead to (slightly) different denoising of the data. This didn’t show up in the denoised total read stats, because until recently DADA2 would correct every read in the dataset, hence denoising does not change the total # of reads. However it did show up in the non-chimeric column, because one error model did a better job of distinguishing rare chimeric variants from the more abundant ASVs that they were generated from. Because those rare chimeras were identified as their own ASVs, they were then removed by the chimera filtering, while in the other dataset they were corrected to the non-chimeric ASVs that they were most similar too.

This happens because chimeras are often rare, and very close to a more abundant true ASV, which is the hardest situation to accurately discriminate. Thus modest changes in the error model can cause some of these chimeras to be identified or not, and you get the slight variation in total overall reads making it through the chimera removal stage.

Hope that makes sense! Overall I wouldn’t worry about this though, this is really fiddling at the difficult to denoise “edges” of the data, and is unlikely to have much if any impact on total community signals.

5 Likes