DADA2 denoise-single chimera removal inconsistence

Cotton · June 20, 2019, 12:16pm

Hi to all!
I'm running QIIME2 2019.1 which is installed by anaconda on a CentOS7 server. I meet a very weird problem.

I have some single-end demultiplexed 16S amplicon sequencing data. Then I imported the data twice by using twice different manifest.csv files (the two manifest files are only different in "sample-id", while the "absolute-filepath" and "direction" are identical). And I used the same dada2 command to do denoise respectively (the code is showed below).

qiime dada2 denoise-single \
  --i-demultiplexed-seqs demux.qza \
  --p-trim-left 0 \
  --p-trunc-len 0 \
  --p-n-threads 10 \
  --o-representative-sequences rep-seqs.qza \
  --o-table table.qza \
  --o-denoising-stats stats-dada2.qza

However the Number of features and total frequency in two table.qzv files are different. So I checked the stats-dada2.qzv files and I found that "input", "filtered", "denoised" columns are same while "non-chimeric" columns are different.

non-chimeric-first	non-chimeric-second
181663	180676
175891	174957
175152	174243
155349	154919
158265	157775
165684	165071
158502	158241
132739	132301
141902	141458
148559	147940
139161	138206
127764	126603
141785	140971
137757	137437

I re-compute the same command showed above using these two demux.qza files and the results are still different. So what may cause this inconsistence?

Thanks in advance!

thermokarst · June 20, 2019, 5:12pm

Can you please share the two manifest files?

Cotton · June 21, 2019, 3:01am

Hi! Thanks for quick reply! Here is the manifest and dada2 denoise summary files (I only upload part of these files not all)
first-manifest.csv (534 Bytes) second-manifest.csv (471 Bytes)
first-dada2-summary.tsv (411 Bytes) second-dada2-summary..tsv (352 Bytes)

thermokarst · June 21, 2019, 9:28pm

Hi @Cotton, I can't do anything with the partial manifests --- can you send the whole ones? Also, can you send the QZAs for the DADA2 summaries, instead of the TSVs? This will allow us to interrogate the provenance, too.

Cotton · June 22, 2019, 10:34am

Hi @thermokarst ! Here are the manifest.csv, demux.qzv, stats-dada2.qza files. Thank you!

first-manifest.csv (4.3 KB) first-demux.qzv (287.1 KB) first-stats-dada2.qza (10.3 KB)

second-manifest.csv (4.5 KB) second-demux.qzv (287.3 KB) second-stats-dada2.qza (10.3 KB)

thermokarst · June 24, 2019, 9:57pm

Thanks so much @Cotton! To be honest, I have no clue what is going on here --- time to call on @benjjneb! As best I can tell, no samples were missed or duplicated when you relabeled the Sample IDs. It looks like everything is identical between the two runs with the exception of the actual Sample IDs used. As far as I can tell, this shouldn't have any bearing on the consensus protocol, but, maybe I am confused. @benjjneb, any clue why the exact same dataset, but with new IDs would produce different results? Thanks!

Cotton · June 25, 2019, 10:56am

Thank you @thermokarst . I have searched google hoping to find someone else who has the same problem like mine but … no one has.
I thought maybe something went wrong when importing data to QIIME2 . So I uncompressed these two demux.qza files to check the gunziped original files’ MD5. But sadly they are identical… So this problem is still a mystery

benjjneb · June 25, 2019, 3:32pm

Huh. I don't know! That is mysterious and shouldn't be different.

I will look into it more, but am on vacation at the moment so it may be a few days.

Cotton · June 26, 2019, 1:56am

Thank you and Enjoy your vocation @benjjneb ! This is just an exercise so I’m not in a hurry. I’ll keep on finding differences between these two runs. Once I find the reasons that caused this inconsistence I’ll update this post.

nmgduan · June 26, 2019, 5:21pm

The possible reason is that DADA2 selects diiferent reads to train the error model (1000000 reads in default). You can add --verbose or -verbose in DADA2 command to find which samples are selected

Cotton · June 27, 2019, 8:24am

Hi @nmgduan ! Thanks for your reply.
DADA2 does select partial reads to train the error model. I am confused about if the selection is random then every time you run DADA2 with the same input the output will be different. However I have re-ran DADA2 and the result is same as the former one which used the same demux.qza file as input (while two demux.qza files results in different outputs).
first-table-run1.qzv (742.7 KB) first-table-run2.qzv (742.7 KB)

And I quickly read DADA2 algorithm (DADA2: High-resolution sample inference from Illumina amplicon data | Nature Methods) it seems that the error model is used in dada() step which will affect the "denoised" column in dada2-summary.tsv not the "non-chimeric" column.

I'll add "--verbose" argument in DADA2 command to check this. Thanks for your advice!

Maybe I should look at the R code of DADA2 to figure out what is going on in the "non-chimeric" step.

ebolyen · July 1, 2019, 4:48pm

@thermokarst and I were talking about this briefly the other day. We're stumped as well, but it may be the case that the error model is still too unstable at 1 million reads. This is a situation where the diagnostic plots that DADA2 produces could be really useful.

So a different file-ordering (from new names) results in a different sampling of reads for the error modeling, which results in different ASVs (but since trimming is the same, we will get the same number of reads at the denoising step), and these different ASVs are considered differently by the chimera checking.

If you ran DADA2 in R, there would be a point to check the error model, it's possible you just need more reads to get it to converge correctly. (You could try 10 mil reads for example and see if that makes the two artifacts more concordant).

This was just our best guess though, we're not really sure either. Our experience has been that DADA2 is exceedingly stable in it's ASV choices so this is kind of weird all around.

Cotton · July 3, 2019, 11:23am

Thank you so much @ebolyen @thermokarst ! I think your guess is quite reasonable. I'll run DADA2 in R to try 10 mil reads and maybe more to train the error model. However, I have something else to do right now so maybe I'll update this post a week later or longer. Sorry for the delay

benjjneb · July 10, 2019, 5:06pm

Quick update, there appears to be a sample-order dependence with the consensus chimera algorithm that we've never caught before. I haven't tracked it down yet, but I think this is probably a bug of some kind so thanks for the report.

In my testing so far it appears to affect ASVs present in very few samples, so not likely to have much effect on final output, but we will be investigating further. Since this is an issue with the dada2 code, I opened a corresponding issue in the dada2 GitHub site which is where progress will show up first: Consensus chimera identification results are affected by sample order · Issue #806 · benjjneb/dada2 · GitHub

benjjneb · July 15, 2019, 8:05pm

Well maybe a false alarm, the bug was actually in my testing code: Consensus chimera identification results are affected by sample order · Issue #806 · benjjneb/dada2 · GitHub

Now that I've fixed that, I'm so far unable to reproduce any sample-order dependence. @Cotton would you be able to share the sequencing data that you are processing with the Q2 plugin with me, so I can try to reproduce the behavior you are seeing?

benjjneb · August 1, 2019, 7:04pm

OK I think I finally understand what is going on here.

As @ebolyen mentioned, the one stochastic element of the dada2 pipeline is that it uses a subset of the total samples to learn the error model. This is what introduces a dependence on sample order, as the first X samples are picked for error model learning. This subsetting is just for computational speed reasons, and if you set the n-reads-learn parameter high enough to use the whole dataset then the sample order dependence should go away.

So, the different manifests led to (slightly) different error models, that lead to (slightly) different denoising of the data. This didn't show up in the denoised total read stats, because until recently DADA2 would correct every read in the dataset, hence denoising does not change the total # of reads. However it did show up in the non-chimeric column, because one error model did a better job of distinguishing rare chimeric variants from the more abundant ASVs that they were generated from. Because those rare chimeras were identified as their own ASVs, they were then removed by the chimera filtering, while in the other dataset they were corrected to the non-chimeric ASVs that they were most similar too.

This happens because chimeras are often rare, and very close to a more abundant true ASV, which is the hardest situation to accurately discriminate. Thus modest changes in the error model can cause some of these chimeras to be identified or not, and you get the slight variation in total overall reads making it through the chimera removal stage.

Hope that makes sense! Overall I wouldn't worry about this though, this is really fiddling at the difficult to denoise "edges" of the data, and is unlikely to have much if any impact on total community signals.