Hi to all!
I'm running QIIME2 2019.1 which is installed by anaconda on a CentOS7 server. I meet a very weird problem.
I have some single-end demultiplexed 16S amplicon sequencing data. Then I imported the data twice by using twice different manifest.csv files (the two manifest files are only different in "sample-id", while the "absolute-filepath" and "direction" are identical). And I used the same dada2 command to do denoise respectively (the code is showed below).
However the Number of features and total frequency in two table.qzv files are different. So I checked the stats-dada2.qzv files and I found that "input", "filtered", "denoised" columns are same while "non-chimeric" columns are different.
non-chimeric-first
non-chimeric-second
181663
180676
175891
174957
175152
174243
155349
154919
158265
157775
165684
165071
158502
158241
132739
132301
141902
141458
148559
147940
139161
138206
127764
126603
141785
140971
137757
137437
I re-compute the same command showed above using these two demux.qza files and the results are still different. So what may cause this inconsistence?
Hi @Cotton, I can't do anything with the partial manifests --- can you send the whole ones? Also, can you send the QZAs for the DADA2 summaries, instead of the TSVs? This will allow us to interrogate the provenance, too.
Thanks so much @Cotton! To be honest, I have no clue what is going on here --- time to call on @benjjneb! As best I can tell, no samples were missed or duplicated when you relabeled the Sample IDs. It looks like everything is identical between the two runs with the exception of the actual Sample IDs used. As far as I can tell, this shouldn't have any bearing on the consensus protocol, but, maybe I am confused. @benjjneb, any clue why the exact same dataset, but with new IDs would produce different results? Thanks!
Thank you @thermokarst . I have searched google hoping to find someone else who has the same problem like mine but … no one has.
I thought maybe something went wrong when importing data to QIIME2 . So I uncompressed these two demux.qza files to check the gunziped original files’ MD5. But sadly they are identical… So this problem is still a mystery
Thank you and Enjoy your vocation @benjjneb ! This is just an exercise so I’m not in a hurry. I’ll keep on finding differences between these two runs. Once I find the reasons that caused this inconsistence I’ll update this post.
The possible reason is that DADA2 selects diiferent reads to train the error model (1000000 reads in default). You can add --verbose or -verbose in DADA2 command to find which samples are selected
Hi @nmgduan ! Thanks for your reply.
DADA2 does select partial reads to train the error model. I am confused about if the selection is random then every time you run DADA2 with the same input the output will be different. However I have re-ran DADA2 and the result is same as the former one which used the same demux.qza file as input (while two demux.qza files results in different outputs). first-table-run1.qzv (742.7 KB) first-table-run2.qzv (742.7 KB)
@thermokarst and I were talking about this briefly the other day. We're stumped as well, but it may be the case that the error model is still too unstable at 1 million reads. This is a situation where the diagnostic plots that DADA2 produces could be really useful.
So a different file-ordering (from new names) results in a different sampling of reads for the error modeling, which results in different ASVs (but since trimming is the same, we will get the same number of reads at the denoising step), and these different ASVs are considered differently by the chimera checking.
If you ran DADA2 in R, there would be a point to check the error model, it's possible you just need more reads to get it to converge correctly. (You could try 10 mil reads for example and see if that makes the two artifacts more concordant).
This was just our best guess though, we're not really sure either. Our experience has been that DADA2 is exceedingly stable in it's ASV choices so this is kind of weird all around.
Thank you so much @ebolyen@thermokarst ! I think your guess is quite reasonable. I'll run DADA2 in R to try 10 mil reads and maybe more to train the error model. However, I have something else to do right now so maybe I'll update this post a week later or longer. Sorry for the delay
Quick update, there appears to be a sample-order dependence with the consensus chimera algorithm that we've never caught before. I haven't tracked it down yet, but I think this is probably a bug of some kind so thanks for the report.
Now that I've fixed that, I'm so far unable to reproduce any sample-order dependence. @Cotton would you be able to share the sequencing data that you are processing with the Q2 plugin with me, so I can try to reproduce the behavior you are seeing?
OK I think I finally understand what is going on here.
As @ebolyen mentioned, the one stochastic element of the dada2 pipeline is that it uses a subset of the total samples to learn the error model. This is what introduces a dependence on sample order, as the first X samples are picked for error model learning. This subsetting is just for computational speed reasons, and if you set the n-reads-learn parameter high enough to use the whole dataset then the sample order dependence should go away.
So, the different manifests led to (slightly) different error models, that lead to (slightly) different denoising of the data. This didn't show up in the denoised total read stats, because until recently DADA2 would correct every read in the dataset, hence denoising does not change the total # of reads. However it did show up in the non-chimeric column, because one error model did a better job of distinguishing rare chimeric variants from the more abundant ASVs that they were generated from. Because those rare chimeras were identified as their own ASVs, they were then removed by the chimera filtering, while in the other dataset they were corrected to the non-chimeric ASVs that they were most similar too.
This happens because chimeras are often rare, and very close to a more abundant true ASV, which is the hardest situation to accurately discriminate. Thus modest changes in the error model can cause some of these chimeras to be identified or not, and you get the slight variation in total overall reads making it through the chimera removal stage.
Hope that makes sense! Overall I wouldn't worry about this though, this is really fiddling at the difficult to denoise "edges" of the data, and is unlikely to have much if any impact on total community signals.