I’m trying one of the tutorials (Atacama soil microbiome) and found something strange during the demultiplex step.
In the metadata file, I randomly chose a sample BAQ3473.3, which the corresponding barcode is (AACCTCGGATAA). Got the reverse complement (TTATCCGAGGTT).
Then search the reverse complete barcode (TTATCCGAGGTT) in the barcodes.fastq file. I got 10965 hits in that file.
However, in the demux-full.qza (demux-full.qzv), the number of sequences for sample BAQ3473.3 is 12991.
Why the number does not match to each other?
I tried another sample, BAQ1370.1.2, there is no hit in the barcodes.fastq file using the reverse complement barcode for searching, while there number of sequences in the demux-full.qzv is 16.
Hi Jordan, can you explain a little bit more? It seems not make sense to me that the number of sequences for a given sample after demultiplex is larger than the number of hits of barcodes in the raw file.
If there is some kind of error correction, the number of per sample sequences should be smaller than the number of hits in the raw barcodes file.
we made the same observation using qiime cutadapt dumux-single with IronTorrent data. We check several runs with QIIME1, an other demultiplex tool and the raw file like you and always we found more hits with QIIME2 tool. look here: Problem with cutadapt demultiplexing of IonTorrent data (Sorry I made the text to long for a forum discussion)
Quite the opposite logic, the EMP protocol uses carefully designed golay barcodes which allow for a certain amount of error correction and salvaging of reads that would otherwise go unassigned. I suspect if you were to rerun the code disabling the error correction (--p-no-golay-error-correction), your results would match your manual search.
You can also do a little digging inside the artifact where you will find something to this effect which confirms this: