Handling mismatches in barcode when demux

wangj50 · September 17, 2020, 7:42pm

Hello QIIME2 team,

The issue of mismatches in barcodes come back to me from time to time during demux. Currently, it seems QIIME could only handle (1) a uniform length of barcode across samples, (2) it has to be exact match.

The reason that this might be an issue occasionally is that some sequencing center provided a barcode sequencing file that may contain barcode sequences shorter than the expected sequence length. E.g. we used EMP protocol to prepare all our libraries (which designs the barcode to be 12bp in length), but in the new sequencing center we have been using, there are about 10-15% of barcode with sequences less than 12 bp, with a majority having 11bp. Because of this, many samples in this run did not have any sequences after demultiplexing (the underlying reason why the barcodes for those none-sequence samples are always shorter than 12bp could be another question to discuss). When I use fastx-toolkit's barcode splitter and allow for some mismatches (--mismatch 3 --partial 2), I find the preivous-none-sequence sample do have a significant number of sequences.

Since the fastx-toolkit could only generated sample-specific barcode sequences, I have to go back to the original sequencing files, using the header of the barcode sequences to extract the per-sample sequences and then import into qiime2 again. It is very slow because I am searching the entire raw sequencing file for the headers matching to the headers of barcode sequences for each sample.

As a result, I was thinking how QIIME2 could improve so that this situation could be handled by QIIME2. After reading through the source code of demux plugin, I think the easiest way to do is to allow the users provide a barcode map, each barcode has to map to a single sample but allow multiple barcodes to map to the same sample (so that mismatches are allowed). This way, the overall structure of the demux plugin can be maintained, and the users can use other tools to generate this barcode map and use it in QIIME.

Thanks!

Mehrbod_Estaki · September 18, 2020, 8:11am

Hi @wangj50,
I believe you have been focusing on the q2-demux plugin which is really streamlined for the EMP protocol, so does have certain limitations as you pointed out. But QIIME 2 currently also has a cutadapt plugin, which is another demultplexing plugin (ex cutadapt demux-paired) that does allow much more flexibility, including error tolerance with the --p-error-rate parameter, and you can provide variable length barcodes in your mapping files. Does this get you where you need?

wangj50 · September 18, 2020, 1:45pm

@Mehrbod_Estaki Thanks for replying. I just checked the cutadapt demux-paired plugin, but it requires the barcodes in sequences. Am I right?

For the sequencing data generated using EMP protocol, typically we would get a forward read (multiplexed), a reverse read, and an index read. The index is not part of the forward or reverse read sequences, nor in the header of the sequencing read.

Mehrbod_Estaki · September 18, 2020, 11:05pm

Hi @wangj50,
Ah, yes, you're right, the q2-cutadapt plugin can't accept barcodes in a separate file. I vaguely remember that the stand-alone cutadapt could handle this case, though don't quote me on this. You'd have too take a look through their docs to be sure. If they do have an option, then it may be the easiest route to just use the standalone tool (which comes preinstalled with QIIME 2) for that step, or add those functionality into the existing q2-cutadapt plugin. That being said, I'm not sure how high on the priority list of the developers this would go at the moment though, however I know they would strongly encourage PRs for these things if you want to take a go at it.

wangj50 · September 19, 2020, 1:45am

@Mehrbod_Estaki Thanks! Just skimmed through the current version of cutadapt standalone user guide, I don't think it would do the thing we want. I may try to generate a PR although I am not too familiar with python and git.
Thanks!

thermokarst · September 19, 2020, 2:52am

Fortunately this isn't that case - barcodes can be different lengths (even between samples) in the q2-demux plugin's methods. If you don't have 12nt barcodes you will have to disable the Golay correction (--p-no-golay-error-correction). Once that is disabled, then it is literally just a matter of matching barcode sequence to the sample metadata barcode column. Give that a try and let us know if it addresses your needs.

wangj50 · September 21, 2020, 6:13pm

@thermokarst Thank you very much for pointing that out. I did not know that is not an issue any more.

When reading the code of _demux.py on github, I have the impression that this would still be an issue because on line 367-368 (q2-demux/q2_demux/_demux.py at dev · qiime2/q2-demux · GitHub) a barcode map is already being generated before checking the argument of golay_error_correction (which is on line 387). And when generating the barcode map using the _make_barcode_map function, it will check if variable length of barcode is inputted and raise an error if so. And in that function, the golay_error_correction flag is not an argument. I'll edit this post to remove this paragraph if I am wrong. Sorry if my post would cause any confusions.

And disabling the Golay correction is still not sufficient to solve my problems because as I mentioned, the issue is rooted in the barcode sequence file which includes barcodes shorter than what is in mapping file (most of them are 1 bp short). If qiime demux only allows exact match, those sequences would be discarded which would leave many of my samples with no sequences at all.

thermokarst · September 21, 2020, 6:29pm

Hi @wangj50!

Actually I don't think that this has ever been the case for this plugin!

Look no further than a concrete example of barcodes less than 12 nts long right here:

github.com

qiime2/q2-demux/blob/a96839624fbd8ac7295388a10adea98b9b84451b/q2_demux/tests/test_demux.py#L547-L557


      
          self.barcodes = [('@s1/2 abc/2', 'AAAA', '+', 'YYYY'),
                           ('@s2/2 abc/2', 'TTAA', '+', 'PPPP'),
                           ('@s3/2 abc/2', 'AACC', '+', 'PPPP'),
                           ('@s4/2 abc/2', 'TTAA', '+', 'PPPP'),
                           ('@s5/2 abc/2', 'AACC', '+', 'PPPP'),
                           ('@s6/2 abc/2', 'AAAA', '+', 'PPPP'),
                           ('@s7/2 abc/2', 'CGGC', '+', 'PPPP'),
                           ('@s8/2 abc/2', 'GGAA', '+', 'PPPP'),
                           ('@s9/2 abc/2', 'CGGC', '+', 'PPPP'),
                           ('@s10/2 abc/2', 'CGGC', '+', 'PPPP'),
                           ('@s11/2 abc/2', 'GGAA', '+', 'PPPP')]

This is test data for the unit tests of this plugin, the barcodes here are all 4 nts long.

This isn't necessary!

Correct - I was only responding to your first point. Since there is no option for mismatches in this plugin, your second point still holds true.

One possible solution is to demux multiple times and merge the results. One for X nts long, then again for X-1 nts long, etc.

wangj50 · September 21, 2020, 7:21pm

@thermokarst thanks! Just to be clear, do they have to be the same length for all samples, in this case 4nts long? When I was talking about the "uniform length of barcode across samples", I meant the same length for all barcodes (regardless if it is 4nt, 8nt, 12nt, etc.).

Actually this probably won't be an issue in the actual sequencing, since the index reads usually is a fixed length.

Thanks!

thermokarst · September 21, 2020, 7:28pm

No, they do not. :qiime2: