Does demux apply any quality threshold to the index reads?

demux

(Robin R Rohwer) #1

Does demux consider index quality?

I am using demux to demultiplex paired-end EMP reads, and it is working fine:

$ qiime demux emp-paired --i-seqs p1.qza --m-barcodes-file p1_mapping.txt --m-barcodes-column BarcodeSequence --output-dir p1_demux`

My question is about understanding the demux method. Specifically:

  • Does demux require an exact match?
  • Does demux apply any quality filter to the index.fastq file before deciding if it’s a match?

I thought there might be a paper that describes the demux algorithm in more depth, but no luck:

$ qiime demux --citations
No citations found.

The reason I want to understand this is that applying a quality filter to the index sequences (the barcodes) has been shown to be an effective way to reduce cross-talk errors (when reads are misattributed to the wrong sample). This is more likely to occur when reads are single-indexed (as in the EMP protocol or whenever you have only 1 index.fastq file), and when samples are highly mulitplexed. (citation: https://doi.org/10.1186/s12864-016-3217-x)

Because demux is starting with a fastq file, it must convert that to a simple fasta sequence before matching it to the sequences in your mapping file. So how did demux interpret the low quality bases in the index fastq?


Set-up Details:

  • I am using qiime2-2019.1 in a conda environment on mac
  • Exact commands in code blocks above, no errors

(Colin Brislawn) #2

Hello @rrohwer!

Same!

The command qiime demux must be one of the most used Q2 plugins. They have got to write a small paper, if only to get those sweet-sweet citations!


I’m not one of the Qiime devs and I’m not deeply familiar with the q2-demux source code, so I’ll let the real qiime devs answer the specifics. I really like your question and I am also interested in the Illumina cross contamination problem.

The barcodes used for demultiplexing always (almost always?) use Golay coding, which is designed to tolerate errors. Basically, each barcode is 2 bp different from every other barcode, so even if there is one error and it’s not an exact match, it should be possible to recover the correct, true barcode.

Does qiime demux emp-paired perform this Golay error correction or does it skip it for speed? :man_shrugging: Looks like this plugin was largely written by @gregcaporaso, so this is a good excuse to @ the lead developer. He can actually answer your question. Hi Greg!


Thank you for sharing that article! I was super surprised that they found quality filtering barcodes was effective in reducing the out-of-bag error rate for barcoding. Maybe it’s the HiSeq 2500 they tested pre-2016 and the modern Illumina MiSeq performs differently?

Compared to barcode quality filtering, I think dual indexing is a much better way to measure and reduce cross-contamination, because of the way cross-contamination happens on the Illumina machines. Based on my understanding, which might be wrong/limited, the majority of barcode mixed-ups happen because two amplicons are annealed very closely on the Illumina flow cell. When their barcode is read, it’s hard to tell which amplicon has which barcode. Reading two distinctive barcodes gives us a second chance to get the right answer, or to find and ignore ambiguous barcodes.

Because the issues is on the flow cell itself, I feel like we would have the best shot at finding and solving it by looking at the raw image of the flow cell. But as far as I know, most current demultiplexing methods start with Illumina-made fastq files.

Am I on the right track here? What’s your understanding of the source of Illumina cross-contamination?


Experimentally identifying cross-contamination should be easy; simply sequence an axenic positive control on every run, and then add up how many other reads end up with the barcode of your positive control and how often your positive control ends up in other samples. My team has 3+ years of axenic positive controls just like this. We have them for sanity checking, but have yet to build a benchmark and write a paper.

Thanks for starting this conversation. I look forward to others qiime-ing in. :qiime2:

Colin

P.S. Here’s some more cross-contamination links:
From the solo-scientist Robert Edgar: https://www.drive5.com/usearch/manual/uncross_algo.html
From the Joint Genome Institute (JGI at Berkeley): http://seqanswers.com/forums/showthread.php?t=73736
From Illumina itself! https://www.illumina.com/content/dam/illumina-marketing/documents/products/whitepapers/index-hopping-white-paper-770-2017-004.pdf


(Nicholas Bokulich) #3

Hi @rrohwer and @colinbrislawn!

Yes — currently it does not support golay barcodes or mismatches, but error correction would be a great feature to add at some point (we have an open issue)

No

Thanks for including that paper! We had discussed that paper, and some others, this discussion:

so yes I have been considering implementing that method at some point but have no eta.

As far as I know it does not and this is a very good point, that low-quality base scores could lead to mis-mapped reads. There are 2 things to consider regarding this: 1) usually quality is good in the barcode reads (there are exceptions!), and quality does not drop off in the same way that it does for very long reads; 2) usually the barcoding schemes require 2+ differences between each barcode so that a single error will by default cause that read to be dropped. It will require 2 errors for a barcode to incorrectly map, and 2 errors in that short read will be very rare.

But I will let @gregcaporaso chime in if he has anything to add regarding the philosophy behind the design of demux.