Strange demultiplexing behavior

Dear all!
Currently I am struggling with several datasets. It is NovaSeq runs, separated by index, multiplexed.
The barcode is in sequence, at the beginning of forward reads. Looks like an easy task for cutadapt demux paired plugin. I lost several samples after demux, that contained few reads, or were ‘poor’, but I thought that it was due to the sequencing itself. But one of my colleagues run the same dataset in mothur, and got all samples as good ones with a lot of reads. I tried several options, including mixed orientation and increasing error rate in cutadapt while demultiplexing, but still loosing ‘poor’ samples. Here is a command for cutadapt:

qiime cutadapt demux-paired \
    --i-seqs multplx.qza \
    --m-forward-barcodes-file metadata.tsv \
    --m-forward-barcodes-column BarcodeSequence \
    --o-per-sample-sequences demux.qza \
    --o-untrimmed-sequences untrimmed.qza \
    --p-error-rate 0.2 \  # tried with 0.1 and 0.3 as well
    --p-mixed-orientation #tried with and without

I tried to run the same outside of qiime env:

cutadapt -e 1 -a file:barcodes.fa -o {name}1.fastq.gz -p {name}2.fastq.gz forward.fastq.gz reverse.fastq.gz

But still got the same ‘poor’ samples to be poor on reads.

The funny thing is that when I switched forward and reverse reads in above command (indicated reverse reads as forward and forward as reverse) - it worked, but I got different ‘poor’ samples. I can not understand why it worked at the first place since there is no barcodes in reverse reads according to protocol. At the same time, running cutadapt with mixed orientation option did not resolved the issue.

So I tried sabre to demultiplex the data with the following command:

sabre pe -m 1 -f forward.fastq.gz -r reverse.fastq.gz -b barcodes.tsv -u no_bc_match_R1.fq -w no_bc_match_R2.fq

This time I got all samples with good amount of reads. Looks like sabre is handling my datasets well.

But I also tried another tool - GBSX:

gbsx --Demultiplexer -f1 forward.fastq.gz -f2 reverse.fastq.gz -i barcodes.tsv -o outdir -mb 1 -t 6 -gzip true -rad true

It worked, but I got some ‘poor’ samples. The strange thing is that these ‘poor’ samples are not the same as I got with cutadapt. So some ‘rich’ samples from cutadapt are ‘poor’ with GBSX and contra versa.

I cannot understand why different tools handling these datasets so differently. Is it a problem with NovaSeq? Can I just proceed with Sabre?

UPD.
Disabling any mismatching leads to similar results between GBSX and Sabre, producing less variable in size demultiplexed files (from 300 kb to 1.7 mb), meanwhile Cutadapt (v3.2, v3.4) produces files very different in size (from 10 kb to 10 mb).

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.