Terrific Bug: Incomplete barcodes being mistakenly demultiplexed into one samples in plugin 'qiime cutadapt demux-paired'

Greeting, Qiime2 team!
I used Qiime2, version 2020.8, to analyze my 16s data. My data format is based on pair-ended sequencing, with the import type --- 'MultiplexedPairedEndBarcodeInSequence'.

And I demultiplex my data to split all samples according to the metadata, using command 'qiime cutadapt demux-paired' as follows:

qiime cutadapt demux-paired
--i-seqs seqs_raw-QC-2_1.qza
--m-forward-barcodes-file metadata-2_1.txt
--m-forward-barcodes-column forward-barcode
--p-error-rate 0
--o-per-sample-sequences seqs_demultiplexed-2_1.qza
--o-untrimmed-sequences seqs_untrimmed-2_1.qza
--verbose
qiime demux summarize \
--i-data seqs_demultiplexed-QC-2_2.qza
--o-visualization seqs_demultiplexed-QC-2_2.qzv

At the same time, I also imply to demultiplex all samples.

fastq-multx -B -m 0 -b forward.fastq reverse.fastq -o %.R1.fastq %.R2.fastq
#manually mv
fastq-multx -B -m 0 -b reverse.fastq forward.fastq -o %.R1.fastq %.R2.fastq
#manually combine two results of fastq-multx

Then I compare the two kinds of demultiplexing results:
'qiime cutadapt demux-paired' remains 8611978 counts of sequence, while 'fastq-multx' only remains 4164473 counts of sequences.

Furthermore, I find that one of my sample---'TA_Blank_3_lib2' show strangely distinct demultiplexing results: 2165635 counts of sequence after 'qiime cutadapt demux-paired'; 4881 counts of sequence after 'fastq-multx'.


To figure out the reason for these great differences, I randomly select one demultiplexing sequence in 'TA_Blank_3_lib2' by 'qiime cutadapt demux-paired':
@A01415:62:H77LVDRXY:1:2101:25201:2394 1:N:0:CGACTGGA
CTACTGGGGTTTCTAATCCTGTTTGATACCCACGCTTTCGTGCTTCAGCGTCAGTTGTACCTTAGTAAGCTGCCTTCGCAATCGGAGTTCTGCGTGATATCTATGCATTCCACCGCTACACCACGCATTCCGCCTACCTCATCTACACTCAAGCCCGCCAGTATCAATGGCAATTTAGGAGTTAAGCTCCTAGATTTCACCGCTGACTTAACAGGCCGCCAACGCACCCTATAAACCCAATAAATCC

I grep this sequence on the unmatch resulting file of 'fastq-multx'

And I further grep this sequence on the raw data:
@A01415:62:H77LVDRXY:1:2101:25201:2394 1:N:0:CGACTGGA
GGACTACTGGGGTTTCTAATCCTGTTTGATACCCACGCTTTCGTGCTTCAGCGTCAGTTGTACCTTAGTAAGCTGCCTTCGCAATCGGAGTTCTGCGTGATATCTATGCATTCCACCGCTACACCACGCATTCCGCCTACCTCATCTACACTCAAGCCCGCCAGTATCAATGGCAATTTAGGAGTTAAGCTCCTAGATTTCACCGCTGACTTAACAGGCCGCCAACGCACCCTATAAACCCAATAAATCC

As shown above, 'qiime cutadapt demux-paired' split this sequence into the sample only by 3 nt sequence---'GGA', which is the end of the barcode of 'TA_Blank_3_lib2' (containing 12 nt of sequences: GAACACTTTGGA).


Therefore, I am very curious about what may cause this problem.

For example, how does this plugin 'qiime cutadapt demux-paired' treat with barcodes like these:
GAACACTTTGGA - sequence (Sample1)
NNNCACTTTGGA - sequence (Sample2)
NNNNNNTTTGGA - sequence (Sample3)
NNNNNNNNNGGA - sequence (Sample4)
NNNNNNNNNGGA - sequence (Sample5)
Do these sequences split into one sample? In our current results, we find that sequences of Sample4 and Sample5 are being demultiplexed into one single sample, which means that this plugin identifies these sequences with incomplete barcodes into the sample samples.

In addition, should I change a tool in qiime2 to demultiplex, like 'qiime demux emp-paired'?

Hi @Yanren_Wang!

I can't help with fastq-multx, but my guess is that this is an issue with how you're telling cutadapt to search for your barcodes. You probably need to "anchor" them - see here for more details:

https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types

As well, have you taken a closer look at the output logs from q2-cutadapt? I see you included the --verbose flag, so they should be there for you to review. That should give you a good idea of what cutadapt is finding.

2 Likes

I think Mr. Wang encountered the problem like that,
5'-anchored 12nt barcode ACGAGACTGATT - mysequence (Sample4)
5'-anchored 12nt barcode GCTGTACGGATT - mysequence (Sample5)
If all barcodes of my sequences are complete, 'qiime cutadapt demux-paired' works fine. However, bad sequencing or library preperation produces bad barcoded sequences, sample4 barcode and sample5 barcode have same last four bases. If these sequences have broken barcodes, like 5'-anchored 4nt barcode GATT - mysequence, in theroy, these sequences in sample4 and sample5 are mixed. However, i found that this plugin splits them into one sample. If this split is right, why sequences belonging to another sample are dropped.

Plugin 'cutadapt demux-paired' only has one parameter about the barcode controlling, --p-error-rate. How can i given the barcode types?

Please see the link I shared above re cutadapt's adapt types - this is how you control things like anchoring and adapter pairing. Keep us posted!