Hi folks,
My name's Jesse, and I'm working in Jed Fuhrman's lab at USC. My colleague Yi-Chun (@LivYeh) and I have been developing a protocol for qiime2 for analyzing the mixed 16s/18s amplicons that are produced by the 515Y/926R primer pair. In case anyone's interested, we've got a basic set of wrapper scripts for qiime2 and an associated protocol up online. This protocol includes a step that splits mixed amplicons into 18s and 16s pools using bbsplit.
In any event, we have been testing the performance of deblur/DADA2 on our samples which include 16s and/or 18s mock communities (a recent paper shows a good example of why these are useful for sequencing QC). Because we know the exact sequences of these mocks, we can test the performance of the two different algorithms. In general, both resolve our mocks exactly, and we have found that DADA2 seems to do so while retaining a lot more of the data (which makes sense since it corrects reads, instead of discarding as Deblur does). So, we're moving towards using DADA2 because it helps us keep more of the small but precious Eukaryotic fraction of our amplicons.
However, we have noticed that DADA2 sometimes produces what we think are artefacts. This doesn't happen very often, and seems to be mostly restricted to noisy 2x300bp data. What we're calling "artifacts" are ASVs with a single mismatch to the mock sequence (tested by BLASTing across the entire length). We don't think that these ASVs can be explained by sample bleedthrough because the rows that they are in are completely empty elsewhere in the OTU table, suggesting they are derived from the "parent" mock sequences. The strange thing is that it doesn't happen all the time - some runs are good.
Things we've tried that were unable to resolve this issue:
-Increasing the number of sequences going into the error model to the maximum number
-Joining the sequences prior to denoising (with VSEARCH)
-Reducing the trim lengths to the minimum 20bp overlap
-Running the same sample on a more recent version of DADA2 in R (1.9.2)
There was one situation where we could actually figure out what was going on. It was a run where we had two extra mock samples from another lane of the HiSeq instrument. So mocks from lane 1 were sequenced at depth x, and then these two extra mocks were sequenced at twice the depth (I believe approximately 100k and 200k, respectively). When analyzing both lanes together, we observed artifacts appearing in the higher-depth mocks. So, I just downsampled those mocks to the approximately same level as everything else and the putative artifacts disappeared.
In comparison, we've never observed deblur to make this kind of ASV call - it always captures the mocks exactly and any 1-mismatches to the mocks are probably derived from real seqs that are present in our actual environmental samples. These potentially spurious ASVs are a concern because they can be abundant (~3-5%) of sequences in some affected samples, and suggest maybe DADA2 can over-split on occasion.
Can Ben (@benjjneb) or anyone else offer advice on any other directions we can pursue to try and resolve these potential artifacts? I think the DADA2 version of q2-dada2 is only 1.6.0 in qiime2-2018.8 (the version we're currently using; I checked by going into the conda env and opening R) but based on our preliminary tests it doesn't seem to make a difference if we use a newer version. Or is there something that has changed in the latest versions that could solve this? The fact that we get similar results from the R version and qiime2 suggests to us that it might be something inherent to the algorithm/dataset and probably not qiime2 related.
Thanks a lot in advance for any help you can offer!
Jesse McNichol