qiime demux must be one of the most used Q2 plugins. They have got to write a small paper, if only to get those sweet-sweet citations!
I’m not one of the Qiime devs and I’m not deeply familiar with the q2-demux source code, so I’ll let the real qiime devs answer the specifics. I really like your question and I am also interested in the Illumina cross contamination problem.
The barcodes used for demultiplexing always (almost always?) use Golay coding, which is designed to tolerate errors. Basically, each barcode is 2 bp different from every other barcode, so even if there is one error and it’s not an exact match, it should be possible to recover the correct, true barcode.
qiime demux emp-paired perform this Golay error correction or does it skip it for speed? Looks like this plugin was largely written by @gregcaporaso, so this is a good excuse to @ the lead developer. He can actually answer your question. Hi Greg!
Thank you for sharing that article! I was super surprised that they found quality filtering barcodes was effective in reducing the out-of-bag error rate for barcoding. Maybe it’s the HiSeq 2500 they tested pre-2016 and the modern Illumina MiSeq performs differently?
Compared to barcode quality filtering, I think dual indexing is a much better way to measure and reduce cross-contamination, because of the way cross-contamination happens on the Illumina machines. Based on my understanding, which might be wrong/limited, the majority of barcode mixed-ups happen because two amplicons are annealed very closely on the Illumina flow cell. When their barcode is read, it’s hard to tell which amplicon has which barcode. Reading two distinctive barcodes gives us a second chance to get the right answer, or to find and ignore ambiguous barcodes.
Because the issues is on the flow cell itself, I feel like we would have the best shot at finding and solving it by looking at the raw image of the flow cell. But as far as I know, most current demultiplexing methods start with Illumina-made fastq files.
Am I on the right track here? What’s your understanding of the source of Illumina cross-contamination?
Experimentally identifying cross-contamination should be easy; simply sequence an axenic positive control on every run, and then add up how many other reads end up with the barcode of your positive control and how often your positive control ends up in other samples. My team has 3+ years of axenic positive controls just like this. We have them for sanity checking, but have yet to build a benchmark and write a paper.
Thanks for starting this conversation. I look forward to others qiime-ing in.
P.S. Here’s some more cross-contamination links:
From the solo-scientist Robert Edgar: https://www.drive5.com/usearch/manual/uncross_algo.html
From the Joint Genome Institute (JGI at Berkeley): http://seqanswers.com/forums/showthread.php?t=73736
From Illumina itself! https://www.illumina.com/content/dam/illumina-marketing/documents/products/whitepapers/index-hopping-white-paper-770-2017-004.pdf