Greeting, Qiime2 team!
I used Qiime2, version 2020.8, to analyze my 16s data. My data format is based on pair-ended sequencing, with the import type --- 'MultiplexedPairedEndBarcodeInSequence'.
And I demultiplex my data to split all samples according to the metadata, using command 'qiime cutadapt demux-paired' as follows:
qiime cutadapt demux-paired
--i-seqs seqs_raw-QC-2_1.qza
--m-forward-barcodes-file metadata-2_1.txt
--m-forward-barcodes-column forward-barcode
--p-error-rate 0
--o-per-sample-sequences seqs_demultiplexed-2_1.qza
--o-untrimmed-sequences seqs_untrimmed-2_1.qza
--verbose
qiime demux summarize \
--i-data seqs_demultiplexed-QC-2_2.qza
--o-visualization seqs_demultiplexed-QC-2_2.qzv
At the same time, I also imply to demultiplex all samples.
fastq-multx -B -m 0 -b forward.fastq reverse.fastq -o %.R1.fastq %.R2.fastq
#manually mv
fastq-multx -B -m 0 -b reverse.fastq forward.fastq -o %.R1.fastq %.R2.fastq
#manually combine two results of fastq-multx
Then I compare the two kinds of demultiplexing results:
'qiime cutadapt demux-paired' remains 8611978 counts of sequence, while 'fastq-multx' only remains 4164473 counts of sequences.
Furthermore, I find that one of my sample---'TA_Blank_3_lib2' show strangely distinct demultiplexing results: 2165635 counts of sequence after 'qiime cutadapt demux-paired'; 4881 counts of sequence after 'fastq-multx'.
To figure out the reason for these great differences, I randomly select one demultiplexing sequence in 'TA_Blank_3_lib2' by 'qiime cutadapt demux-paired':
@A01415:62:H77LVDRXY:1:2101:25201:2394 1:N:0:CGACTGGA
CTACTGGGGTTTCTAATCCTGTTTGATACCCACGCTTTCGTGCTTCAGCGTCAGTTGTACCTTAGTAAGCTGCCTTCGCAATCGGAGTTCTGCGTGATATCTATGCATTCCACCGCTACACCACGCATTCCGCCTACCTCATCTACACTCAAGCCCGCCAGTATCAATGGCAATTTAGGAGTTAAGCTCCTAGATTTCACCGCTGACTTAACAGGCCGCCAACGCACCCTATAAACCCAATAAATCC
I grep this sequence on the unmatch resulting file of 'fastq-multx'
And I further grep this sequence on the raw data:
@A01415:62:H77LVDRXY:1:2101:25201:2394 1:N:0:CGACTGGA
GGACTACTGGGGTTTCTAATCCTGTTTGATACCCACGCTTTCGTGCTTCAGCGTCAGTTGTACCTTAGTAAGCTGCCTTCGCAATCGGAGTTCTGCGTGATATCTATGCATTCCACCGCTACACCACGCATTCCGCCTACCTCATCTACACTCAAGCCCGCCAGTATCAATGGCAATTTAGGAGTTAAGCTCCTAGATTTCACCGCTGACTTAACAGGCCGCCAACGCACCCTATAAACCCAATAAATCC
As shown above, 'qiime cutadapt demux-paired' split this sequence into the sample only by 3 nt sequence---'GGA', which is the end of the barcode of 'TA_Blank_3_lib2' (containing 12 nt of sequences: GAACACTTTGGA).
Therefore, I am very curious about what may cause this problem.
For example, how does this plugin 'qiime cutadapt demux-paired' treat with barcodes like these:
GAACACTTTGGA - sequence (Sample1)
NNNCACTTTGGA - sequence (Sample2)
NNNNNNTTTGGA - sequence (Sample3)
NNNNNNNNNGGA - sequence (Sample4)
NNNNNNNNNGGA - sequence (Sample5)
Do these sequences split into one sample? In our current results, we find that sequences of Sample4 and Sample5 are being demultiplexed into one single sample, which means that this plugin identifies these sequences with incomplete barcodes into the sample samples.
In addition, should I change a tool in qiime2 to demultiplex, like 'qiime demux emp-paired'?