Hello QIIME2 team,
The issue of mismatches in barcodes come back to me from time to time during demux. Currently, it seems QIIME could only handle (1) a uniform length of barcode across samples, (2) it has to be exact match.
The reason that this might be an issue occasionally is that some sequencing center provided a barcode sequencing file that may contain barcode sequences shorter than the expected sequence length. E.g. we used EMP protocol to prepare all our libraries (which designs the barcode to be 12bp in length), but in the new sequencing center we have been using, there are about 10-15% of barcode with sequences less than 12 bp, with a majority having 11bp. Because of this, many samples in this run did not have any sequences after demultiplexing (the underlying reason why the barcodes for those none-sequence samples are always shorter than 12bp could be another question to discuss). When I use fastx-toolkit’s barcode splitter and allow for some mismatches (–mismatch 3 --partial 2), I find the preivous-none-sequence sample do have a significant number of sequences.
Since the fastx-toolkit could only generated sample-specific barcode sequences, I have to go back to the original sequencing files, using the header of the barcode sequences to extract the per-sample sequences and then import into qiime2 again. It is very slow because I am searching the entire raw sequencing file for the headers matching to the headers of barcode sequences for each sample.
As a result, I was thinking how QIIME2 could improve so that this situation could be handled by QIIME2. After reading through the source code of demux plugin, I think the easiest way to do is to allow the users provide a barcode map, each barcode has to map to a single sample but allow multiple barcodes to map to the same sample (so that mismatches are allowed). This way, the overall structure of the demux plugin can be maintained, and the users can use other tools to generate this barcode map and use it in QIIME.