Dear qiime2 team,
First of all, I would like to congratulate you on your excellent work and the tidy organization of your website.
There may be some useful improvements regarding demux emp-paired golay-error-correction. I have recently seen instances where the combination of a barcode sequence error and --p-golay-error-correction caused over 100 reads from one sample to be mixed in with other samples. The barcode sequence errors included insertions/deletions. This means that golay-error-correction corrects all 3-bit errors and may improperly correct errors including insertions/deletions.
Therefore, I thought that if there was an error-correction option that corrected only the most common single nucleotide substitutions, it would correct most of the errors that occur without improperly correcting errors with insertions/deletions. If you could add such a feature to emp-paired, it would be greatly appreciated.
Thank you all very much.
Hello @dfground and welcome to the forums!
Thank you for bringing this to our attention.
It would be interesting to see some of the data on this - it really shouldn't "correct" a real barcode to another real barcode.
If a barcode has so many errors that it becomes identical to another real barcode, that could explain this behavior. But also that would be hard to identify unless you already knew what would be in each sample. How did you discover this issue?
Could you share with us a minimal working example from your data set?
Thank you!
Thanks for the reply. Sorry for the delay in replying.
The steps I took to find this issue are as follows.
I ran the qiime demux emp-paired --m-barcodes-file with a total of 4096 golay barcodes, including barcodes not actually used in the real samples.
I then analyzed the details of the details.tsv file included in the qza output from --o-error-correction-details. As a result, we found 899 and 319 reads, respectively, with single nucleotide insertion errors in the real barcode, as shown in the two figures below. Both of them were decoded to the index of GTATTGACGGTC by golay-error-correction. Indeed, based on the similarity of ASV compositions, the reads demuxed to GTATTGACGGGTC were inferred to be derived from the GTATTACGATCC reads.
This is the most extreme case, but many similar errors were found.
Thank you for helping us investigate this.
Using every barcode is good practice. Not everyone does this...
Very interesting!
Can you share with us the details.tsv file? You can post the file here, or send me a direct message so I can share it with the Qiime2 devs.
Thank you for the reply.
Since I may not have permission to provide my details.tsv here, I have instead performed a similar analysis using data from the “Atacama soil microbiome” tutorial. A link to the details.tsv file is provided below.
From this file, I found reads that appear to be errors of single nucleotide deletions in unused indexes.
For example, from CCAGGTATATTC to CCAGTATATTCA, from TGTGCAAGCGAC to GTGCAAGCGACA, from TTCTCACCTTTC to TCTCACCTTTCA, 45, 42, 45 reads were found respectively.
The composition of features of these unused index reads was consistent with the derived samples as shown in the figure below.
Thank you for sharing this!
I like your approach of replicating this error with public data.
I'm trying to understand what you have found.
I'm comparing this to your private data:
CLUSTAL format alignment by MAFFT (v7.511)
After-gola gtattgacgg-tc
demuxed-to gtattgacgggtc
truly-from gtatt-acgatcc
***** ***. .*
Are these two errors the same, or just a lesser version that does not cause sample misidentification?
I'm also struggling to understand what you found.
Both of them were decoded to the index of GTATTGACGGTC by golay-error-correction. Indeed, based on the similarity of ASV compositions, the reads demuxed to GTATTGACGGGTC...
GTATTGACGGTC
GTATTGACGGGTC
These are not the same....
Are 'decoded' and 'demuxed' different? What do you mean?
(Golay should only be correcting to real barcodes, so after-golay, decoded, and demuxed should all be identical.)
Thanks for the reply.
I apologize for the confusion caused by a notation error in my response on 4/18.
GTATTGACGGGTC was incorrect, it was GTATTGACGGTC.
I also think I did not provide enough information, so I have added it below.
Here is what I found
barcode-uncorrected 1 (GTA TTT ACG ATC) and barcode-uncorrected 2 (GTA TTA ACG ATC) have been corrected to unused index 1 (GTA TTG ACG GTC) by golay-error-correction.
barcode-uncorrected 1 GTATTTACGATC
unused index 1 GTATTGACGGTC
***** ***.**
barcode-uncorrected 2 GTATTAACGATC
unused index 2 GTATTGACGGTC
*****.***.**
Both barcode-uncorrected 1 and barcode-uncorrected 2 were thought to be caused by the insertion of a single base into used index 1 (GTA TTA CGA TCC).
barcode-uncorrected 1 GTATTTACGATC-
used index 1 GTATT-ACGATCC
***** ******
barcode-uncorrected 2 GTATTAACGATC-
used index 1 GTATTA-CGATCC
****** *****
Hello @dfground
I've opened a new issue on GitHub describing this issue.
If you use GitHub, you can share with us anything else you find!
Thank you for bringing this to our attention.
Hello @dfground,
I was unable to replicate the findings from your May 6 post. When running the demux command from the Atacama soil tutorial, I also get a different details.tsv file than the one you posted. To be able to replicate this I would need your details artifact itself (not just the tsv file within it), and a list of anything that you did differently than what the tutorial specifies.