I am writing this to post to evaluate if our problem can be solved with qiime2 or if its a lab- or provider problem.
We are facing the problem that our Illumina sequence data severely suffers from barcode mismatches between R1 and R2. We usually deploy a 2 x 250 HiSeq protocol targeting a 300 bp region of the 16S gene or ribosome.
In our current DADA-2 pipeline, demultiplexing occurs by checking both barcodes in R1 and R2 by looping cutadapt through the barcodes with this parameters:
cutadapt -j 30 -O 8 --no-indels -e 0 -g ^Barcode1 -G ^Barcode1 --discard-untrimmed
However, in most cases the barcodes differ between corresponding reads in R1 and R2. This step results in a read loss of somewhere between 50% and 95% of all reads. Meaning that in a 10 million read project we – at the very minimum - lose more than 5 million observations just by demultiplexing.
We checked if our barcode library has a contamination problem by testing if there are systematic errors (e.g. overrepresentation of certain barcodes, or if barcodes from neighboring box locations affect each other more frequently) but failed to see any.
The problem is so severe that the scientific interpretation drastically changes between using the merged paired end reads and using only the single reads for analysis. Interestingly there is a good agreement between using R1 and R2 alone, but both differ greatly from the merged reads. This is likely the result of the read loss and the subsequent loss of many rare reads, and thus ASVs.
Our provider insists that there is no error on their side.
Do you also experience strong mismatches between barcodes of corresponding reads in R1 and R2?
Do you have any advice on resolving this problem? Can this be further evaluated or even fixed with QIIME2?