Demultiplexing mixed orientation reads with qiime cutadapt demux-paired

Hello,

I'm trying to demultiplex a pool of samples that are dual barcoded with mixed orientation reads. The documentation indicates that the command qiime cutadapt demux-paired should be able to handle the reads with the argument --p-mixed-orientation set as true.

Despite this, everything I've read in the forums indicates that this isn't currently possible, but they also aren't very recent posts.

So I gave it a try myself with the following command:

qiime cutadapt demux-paired \
    --i-seqs MMF.16S.330.qza \
    --m-forward-barcodes-file MMf.16S.330_testmapping.tsv \
    --m-forward-barcodes-column FwBC+FwPrimer \
    --m-reverse-barcodes-file MMf.16S.330_testmapping.tsv \
    --m-reverse-barcodes-column RvBC+RvPrimer\
    --p-mixed-orientation TRUE \
    --o-per-sample-sequences demux/per-sample-sequences.qza \
    --o-untrimmed-sequences demux/untrimmed-sequences.qza \
    --verbose > demuxlog.txt\
    --p-cores 16

My metadata sheet had the following format:

#SampleID	BarcodeSequence	PrimerSequence	ReversePrimer	BC+FwPrimer	BC+RvPrimer
Sample1 	Barcode1	FwPrimerSeq	RvPrimerSeq	Barcode1+FwPrimerSeq	Barcode1+RvPrimerSeq
Sample2	Barcode2	FwPrimerSeq	RvPrimerSeq	Barcode2+FwPrimerSeq	Barcode2+RvPrimerSeq
Sample3	Barcode3	FwPrimerSeq	RvPrimerSeq	Barcode3+FwPrimerSeq	Barcode3+RvPrimerSeq

I used the barcodes+primer sequence to create a "unique" barcode for each orientation but when I did the demux I got poor results in the log file:

=== Summary ===

Total read pairs processed:            734,599
  Read 1 with adapter:                 128,404 (17.5%)
  Read 2 with adapter:                 128,404 (17.5%)

== Read fate breakdown ==
Pairs that were too short:                   1 (0.0%)
Pairs discarded as untrimmed:                0 (0.0%)
Pairs written (passing filters):       734,598 (100.0%)

Total basepairs processed:   368,768,698 bp
  Read 1:   184,384,349 bp
  Read 2:   184,384,349 bp
Total written (filtered):    360,557,324 bp (97.8%)
  Read 1:   180,453,622 bp
  Read 2:   180,103,702 bp

=== Summary ===

Total read pairs processed:            606,195
  Read 1 with adapter:                 169,961 (28.0%)
  Read 2 with adapter:                 169,961 (28.0%)

== Read fate breakdown ==
Pairs that were too short:                   2 (0.0%)
Pairs discarded as untrimmed:                0 (0.0%)
Pairs written (passing filters):       606,193 (100.0%)

Total basepairs processed:   304,309,890 bp
  Read 1:   152,154,945 bp
  Read 2:   152,154,945 bp
Total written (filtered):    293,260,925 bp (96.4%)
  Read 1:   146,944,761 bp
  Read 2:   146,316,164 bp

Sorry if this has been asked before (I think it has) I'm just getting mixed signals between the documentation and the forum posts I've read

1 Like

Just a quick update on this, I tried to import read 1 and 2 as one MultiplexedSingleEndBarcodeInSequence archive file each. Then demultiplexed the files from there, using qiime cutadapt demux-single with just the barcode+Primer column. I intended to concatenate these files together to get my full demultiplex.

Unfortunately after this attempt I noticed that my yield ended up being TOO high.

=== Summary R1-BC+FWPrimer ===

Total reads processed:               4,788,429
Reads with adapters:                 2,600,778 (54.3%)


=== Summary R1-BC+RvPrimer ===

Total reads processed:               4,788,429
Reads with adapters:                 2,387,848 (49.9%)

I'm getting over 100% of my input as output! :man_facepalming:

Hello @Che. Mixed orientation reads are fine, and your initial cutadapt command looked good to me. It's dual indexing that is currently a problem, but support for this is being added in QIIME 2 2023.7 IMP: Allow demultiplexing of mixed orientation reads with dual indices (#55) by vaamb · Pull Request #55 · qiime2/q2-cutadapt · GitHub

That's awesome news! I couldn't find any additional information on how to use the dual indexing. I tried the same input as before and more disappointing results.

This is cutadapt 4.4 with Python 3.8.15
Command line parameters: --front file:/tmp/tmpnyfl_gr5 --error-rate 0.1 --minimum-length 1 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-2a8titqr/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedPairedEndBarcodeInSequenceDirFmt-uohydxc6/forward.fastq.gz --pair-adapters -G file:/tmp/tmppu_3plye -p /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-2a8titqr/{name}.2.fastq.gz --untrimmed-paired-output /tmp/q2-MultiplexedPairedEndBarcodeInSequenceDirFmt-uohydxc6/reverse.fastq.gz /tmp/qiime2/root/data/64072ddb-8cb5-421d-b151-e7ed052b2c95/data/forward.fastq.gz /tmp/qiime2/root/data/64072ddb-8cb5-421d-b151-e7ed052b2c95/data/reverse.fastq.gz -j 16
Processing paired-end reads on 16 cores ...
Finished in 32.312 s (43.986 µs/read; 1.36 M reads/minute).

=== Summary ===

Total read pairs processed:            734,599
  Read 1 with adapter:                 128,404 (17.5%)
  Read 2 with adapter:                 128,404 (17.5%)

== Read fate breakdown ==
Pairs that were too short:                   1 (0.0%)
Pairs discarded as untrimmed:                0 (0.0%)
Pairs written (passing filters):       734,598 (100.0%)

Total basepairs processed:   368,768,698 bp
  Read 1:   184,384,349 bp
  Read 2:   184,384,349 bp
Total written (filtered):    360,557,324 bp (97.8%)
  Read 1:   180,453,622 bp
  Read 2:   180,103,702 bp

What's flawed about the my approach here?

To be clear, you installed 2023.7 and ran the exact same command as in your first post in 2023.7 correct?

Yeah the exact same input yields the exact same result even after the update. Is there an issue with my barcode metadata? Maybe I should just feed in the barcodes themselves?

Hello @Che,

I used the barcodes+primer sequence to create a "unique" barcode for each orientation

Could you elaborate on this? Why was this necessary? Does this mean that the sequences in these columns are literally the barcode and one of the primers stuck together?

What is the dual indexing scheme that was used? Do you know if it was the typical kind or unique dual indexing?

I'm not sure if this is typical indexing or not but the structure of the indexing is as follows:

This is multiplexed v4-v5 16S amplicon data. For each sample, there is 1 barcode sequence that shared between both the forward and reverse primers added to that samples PCR reaction. So for each sample there will be reads that contain its barcode followed by either the forward or reverse primer that was added to the sample. My metadata has columns for the individual barcodes, primer sequences, and the full barcoded primer of both orientations that was in the PCR reaction.

#SampleID	BarcodeSequence	PrimerSequence	ReversePrimer	BC+FwPrimer	BC+RvPrimer
Sample1 	Barcode1	FwPrimerSeq	RvPrimerSeq	Barcode1+FwPrimerSeq	Barcode1+RvPrimerSeq
Sample2	Barcode2	FwPrimerSeq	RvPrimerSeq	Barcode2+FwPrimerSeq	Barcode2+RvPrimerSeq
Sample3	Barcode3	FwPrimerSeq	RvPrimerSeq	Barcode3+FwPrimerSeq	Barcode3+RvPrimerSeq

This is the command I used:

qiime cutadapt demux-paired \
    --i-seqs MMF.16S.330.qza \
    --m-forward-barcodes-file MMf.16S.330_testmapping.tsv \
    --m-forward-barcodes-column FwBC+FwPrimer \
    --m-reverse-barcodes-file MMf.16S.330_testmapping.tsv \
    --m-reverse-barcodes-column RvBC+RvPrimer\
    --p-mixed-orientation TRUE \
    --o-per-sample-sequences demux/per-sample-sequences.qza \
    --o-untrimmed-sequences demux/untrimmed-sequences.qza \
    --verbose > demuxlog.txt\
    --p-cores 16

The reason I provided the full barcode+primerseq was because when providing the same barcode for the forward and reverse reads i got an error about 1 barcode being referenced multiple times, even with the --p-mixed-orientation TRUE argument.

I assumed adding the primer sequence would satisfy the requirement for a unique barcode between the forward and reverse reads. This was the only way I got output from the demux command

Hello @Che,

I see. Would you mind providing an excerpt from the R1 and R2 sequencing files, say the first 40 lines or so? If this is not the structure of your data let me know. You might have to extract them first--use gzip -cd <file> | head -40 if you're on linux or Mac. Could you also attach the barcode metadata file?

MMf.16S.330_testmapping.tsv (9.3 KB)
330_R2_test.fastq (5.6 KB)
330_R1_test.fastq (5.6 KB)

I've attached some test excerpt files for you to take a look at, hopefully it captures the information you need.

Hello @Che,

What is your understanding of "mixed orientation"? I've seen it defined as different things in different places, and want to make sure we're on the same page. The cutadapt command that you're using defines it as "when forward and reverse reads coexist in the same file". If this is the definition you're using, what is the evidence that this is the case in your situation? Have you found the same primer in both R1 and R2 files?

I wonder if your attempt at two rounds of single-end demuxing double-counted because of the shared barcode sub-sequence in both of the barcode-primer constructs. Perhaps lowering the --p-error-rate parameter would help with specificity. It seems like your results were only a little off there.

I'm also confused why you have barcodes in-sequence and (different) barcodes in your headers. Is there any chance that your sequences were barcoded again during library prep?

Ah I did leave that some details I suppose. This pooled16S data was multiplexed with other libraries, so there was an additional index added to the sample. This was the data that belonged to us, but there is still more demultiplexing necessary to get down to the sample level for analysis.

The reason we know that the barcodes are present in both reads is that in our 16S PCR reaction both the forward and reverse primers are barcoded.

If do 2 single end demultiplex runs on R1 using the barcode combined with the forward and reverse primers I get the following:

=== Summary R1-BC+FWPrimer ===

Total reads processed:               4,788,429
Reads with adapters:                 2,600,778 (54.3%)


=== Summary R1-BC+RvPrimer ===

Total reads processed:               4,788,429
Reads with adapters:                 2,387,848 (49.9%)

So I'm definitely using mixed orientations in the way you described them.
I would combine the single end demux results if I didn't get more than 100% total output so I'll be giving your suggest a try soon and report back

Hello @Che,

If possible, could show the error you got when providing the same sequence for the forward and reverse barcodes?

Another option you have is to use demux-paired and only pass the forward barcode file and column.

Here's my output with just the barcode sequence for both reads:

qiime cutadapt demux-paired \
    --i-seqs MMF.16S.330.qza \
    --m-forward-barcodes-file MMf.16S.330_testmapping.tsv \
    --m-forward-barcodes-column BarcodeSequence \
    --m-reverse-barcodes-file MMf.16S.330_testmapping.tsv \
    --m-reverse-barcodes-column BarcodeSequence \
    --p-mixed-orientation TRUE \
    --o-per-sample-sequences demux/per-sample-sequences.qza \
    --o-untrimmed-sequences demux/untrimmed-sequences.qza \
    --verbose > demuxlog.txt \
    --p-cores 16


/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_cutadapt/_demux.py:183: 
UserWarning: The following samples are using identical barcode for forward and reverse. 
Your resulting sequences might have sequences both in their forward and reverse form 
(you might use the vsearch plugin and perform a de novo clustering with an identity 
threshold of '1' and the strand parameter set to 'both' to merge such sequences together):  ...
warnings.warn("The following samples are using identical barcode "
Traceback (most recent call last):
  File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2cli/commands.py", line 478, in __call__
    results = self._execute_action(
  File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2cli/commands.py", line 539, in _execute_action
    results = action(**arguments)
  File "<decorator-gen-67>", line 2, in demux_paired
  File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/action.py", line 342, in bound_callable
    outputs = self._callable_executor_(
  File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/action.py", line 566, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_cutadapt/_demux.py", line 273, in demux_paired
    untrimmed = _demux(
  File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_cutadapt/_demux.py", line 224, in _demux
    run_command(cmd)
  File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_cutadapt/_demux.py", line 38, in run_command
    subprocess.run(cmd, check=True)
  File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError:

It never makes it far enough to produce an output

Hello @Che,

Is what you posted the entirety of the demuxlog.txt file?

Any update on lowering the --p-error-rate parameter?

Also, I think it's worth giving what I suggested above a try:

Another option you have is to use demux-paired and only pass the forward barcode file and column.

  1. Here is the entirety of the demuxlog.txt. Early i just gave you the terminal output
This is cutadapt 4.4 with Python 3.8.15
Command line parameters: --front file:/tmp/tmphiedznvy --error-rate 0.1 --minimum-length 1 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-gyiwkabi/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedPairedEndBarcodeInSequenceDirFmt-bqmj6yt0/forward.fastq.gz --pair-adapters -G file:/tmp/tmp8ax935mx -p /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-gyiwkabi/{name}.2.fastq.gz --untrimmed-paired-output /tmp/q2-MultiplexedPairedEndBarcodeInSequenceDirFmt-bqmj6yt0/reverse.fastq.gz /tmp/qiime2/root/data/64072ddb-8cb5-421d-b151-e7ed052b2c95/data/forward.fastq.gz /tmp/qiime2/root/data/64072ddb-8cb5-421d-b151-e7ed052b2c95/data/reverse.fastq.gz -j 16
ERROR: Character '2' in adapter sequence 'CC238           CGTCCGTATGAAQD4749          AACACTCGATCGQD4750          GTGGTGGTTTCCCC222           GTTGACCATCGCCC223           TGACTGCGTTAG' is not a valid IUPAC code. Use only characters 'ABCDGHIKMNRSTUVWXY'.
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: cutadapt --front file:/tmp/tmphiedznvy --error-rate 0.1 --minimum-length 1 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-gyiwkabi/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedPairedEndBarcodeInSequenceDirFmt-bqmj6yt0/forward.fastq.gz --pair-adapters -G file:/tmp/tmp8ax935mx -p /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-gyiwkabi/{name}.2.fastq.gz --untrimmed-paired-output /tmp/q2-MultiplexedPairedEndBarcodeInSequenceDirFmt-bqmj6yt0/reverse.fastq.gz /tmp/qiime2/root/data/64072ddb-8cb5-421d-b151-e7ed052b2c95/data/forward.fastq.gz /tmp/qiime2/root/data/64072ddb-8cb5-421d-b151-e7ed052b2c95/data/reverse.fastq.gz -j 16

  1. I went down

--p-error-rate .01 \

R1

=== Summary ===

Total reads processed:               4,788,429
Reads with adapters:                 2,545,045 (53.1%)

== Read fate breakdown ==
Reads that were too short:                   1 (0.0%)
Reads discarded as untrimmed:                0 (0.0%)
Reads written (passing filters):     4,788,428 (100.0%)

Total basepairs processed: 1,201,895,679 bp
Total written (filtered):  1,129,988,947 bp (94.0%)

R2

=== Summary ===

Total reads processed:               4,788,429
Reads with adapters:                 2,329,012 (48.6%)

== Read fate breakdown ==
Reads that were too short:                   1 (0.0%)
Reads discarded as untrimmed:                0 (0.0%)
Reads written (passing filters):     4,788,428 (100.0%)

Total basepairs processed: 1,201,895,679 bp
Total written (filtered):  1,125,246,293 bp (93.6%)
  1. For this suggest

If I do this I won't be able to demux my samples from R2 though. Since this will only check the R1 file from what I can tell

Hello @Che,

If I do this I won't be able to demux my samples from R2 though. Since this will only check the R1 file from what I can tell

No, I don't believe so. This is analogous to demultiplexing single-barcoded paired-end libraries which is done all the time. The software knows the forward/reverse read linkage based on information that is not the barcode sequences. The only complication here is the mixed orientation, although it is still not completely clear to me how that is coming into play. So give this a whirl and see what happens.

If I only pass the barcode with the primers I get the following:

qiime cutadapt demux-paired \
    --i-seqs MMF.16S.330.qza \
    --m-forward-barcodes-file MMf.16S.330_testmapping.tsv \
    --m-forward-barcodes-column BC_FwPrimer \
    --p-mixed-orientation TRUE \
    --o-per-sample-sequences demux/per-sample-sequences.qza \
    --o-untrimmed-sequences demux/untrimmed-sequences.qza \

Done           00:00:25       734,599 reads @  34.3 µs/read;   1.75 M reads/minute
Done           00:00:21       538,204 reads @  39.8 µs/read;   1.51 M reads/minute

=== Summary ===

Total read pairs processed:            734,599
  Read 1 with adapter:                 196,395 (26.7%)

== Read fate breakdown ==
Pairs that were too short:                   1 (0.0%)
Pairs discarded as untrimmed:                0 (0.0%)
Pairs written (passing filters):       734,598 (100.0%)

Total basepairs processed:   368,768,698 bp
  Read 1:   184,384,349 bp
  Read 2:   184,384,349 bp
Total written (filtered):    363,867,218 bp (98.7%)
  Read 1:   179,483,120 bp
  Read 2:   184,384,098 bp

If I just pass the barcode alone:

qiime cutadapt demux-paired \
    --i-seqs MMF.16S.330.qza \
    --m-forward-barcodes-file MMf.16S.330_testmapping.tsv \
    --m-forward-barcodes-column BarcodeSequence \
    --p-mixed-orientation TRUE \
    --o-per-sample-sequences demux/per-sample-sequences.qza \
    --o-untrimmed-sequences demux/untrimmed-sequences.qza \

Done           00:00:12       734,599 reads @  17.7 µs/read;   3.39 M reads/minute
Done           00:00:05        70,731 reads @  72.0 µs/read;   0.83 M reads/minute

=== Summary ===

Total read pairs processed:            734,599
  Read 1 with adapter:                 663,868 (90.4%)

== Read fate breakdown ==
Pairs that were too short:                 177 (0.0%)
Pairs discarded as untrimmed:                0 (0.0%)
Pairs written (passing filters):       734,422 (100.0%)

Total basepairs processed:   368,768,698 bp
  Read 1:   184,384,349 bp
  Read 2:   184,384,349 bp
Total written (filtered):    357,815,027 bp (97.0%)
  Read 1:   173,475,105 bp
  Read 2:   184,339,922 bp

It's definitely still missing reads. Would there be a problem with just merging the single read demux results from the same sample together? Is there a particular way i need to approach that so I can continue using Qiime for the downstream analysis after the demux?