Dual-barcode mix-orientation paired-end reads

foreverwander · July 15, 2021, 2:52am

Pardon my terminology as I am brand new to the microbiome sequencing bioinformatics space. Any help / advice is appreicated!

I have 16S V4-V5 sequencing amplified with the 563F and 926R primers, from MiSeq. I received paired-end multiplexed FASTQ files (R1.fastq.gz and R2.fastq.gz), as well as a file containing primer and barcode mapping for each sample. Example as such:

My understanding so far is that, AYTGGGYDTAAAGNG is the forward primer and CCGTCAATTYHTTTRAGT is the reverse primer. The data also seems to be "dual-barcoded", meaning that a barcode sequence exists in both the R1 and R2 sequence. Here's an example:

R1 (note line number precedes FASTQ content):
201-@M00430:5389:000000000-J2L69:1:2106:19124:2126 1:N:0:ATAGTCTG+GCATTGCA
202:ACGCAATGTACGCAATGTCTGATTGGGTGTAAAGTG...read continues to length of 251

R2 (line number in front):
201-@M00430:5389:000000000-J2L69:1:2106:19124:2126 2:N:0:ATAGTCTG+GCATTGCA
202:CACGCAATGTCTGCCGTCAATTTTTTTTAGT...read continues to length of 251

So I am dealing with "barcode in-sequence" data, and the primer seems to be immediately following the barcode, so far so good.

And then I realized that for the same adaptor, I'm seeing reads in R1 and correspondingly in R2 that have two sets of different lengths of sequences preceding them. So for the same adaptor as above I also have reads like:

R1 (line number precedes FASTQ content):
337-@M00430:5389:000000000-J2L69:1:2106:12161:2165 1:N:0:ATAGTCTG+GCATTGCA
338:AACGCAATGTCTGCCGTCAATTCCTTTAAGT...read continues to length of 251

R2 (line number):
337-@M00430:5389:000000000-J2L69:1:2106:12161:2165 2:N:0:ATAGTCTG+GCATTGCA
338:CCGCAATGTACGCAATGTCTGACTGGGTTTAAAGGG...read continues to length of 251

Since I'm seeing barcodes after the adaptors and what should be the reverse primer in R1 and the forward primer in R2 (all NOT in reverse complement), I believe this means that I have "mixed-orientation reads" as well...

Taken together, I believe I have multiplexed paired-end dual-barcode reads with barcode-in-sequence and mixed-orientation, is that correct?

I was hoping that the --p-mixed-orientation option in qiime cutadapt demux-paired could solve my problems but got the error that Dual-indexed barcodes for mixed orientation reads are not supported. so now am I at a lost again...

I will be going through the rest of the forum trying to comb through terminology and find a solution. Apologies in advance that this might be a duplicated question, but even if I can get confirmation that indeed my reads are what I have diagnosed them to be that would be fantastic.

Thank you!

foreverwander · July 15, 2021, 4:01pm

Just to follow-up with this post and document what I've done.

After setting up the qiime object:

qiime tools import \
  --type MultiplexedPairedEndBarcodeInSequence \
  --input-path qiime2-2021.4/fastqgz/ \
  --output-path qiime2-2021.4/1038_IGO_10763_2_S2_paired.qza

I tried both demux using only forward barcode:

qiime cutadapt demux-paired \
  --i-seqs qiime2-2021.4/1038_IGO_10763_2_S2_paired.qza \
  --m-forward-barcodes-file meta/sample_barcode_barcode_alphanumeric_header.txt \
  --m-forward-barcodes-column BarcodeForward \
  --o-per-sample-sequences qiime2-2021.4/1038_IGO_10763_2_S2_paired_demux_fonly.qza \
  --o-untrimmed-sequences qiime2-2021.4/1038_IGO_10763_2_S2_paired_demux_fonly_unmatched.qza \
  --verbose > 2021-07-15_qiime_cutadapt_demux-paired_forwardbarcodesonly.log

Getting this summary:

=== Summary ===

Total read pairs processed: 3,531,111
Read 1 with adapter: 3,483,369 (98.6%)
Read 2 with adapter: 0 (0.0%)
Pairs that were too short: 1 (0.0%)
Pairs written (passing filters): 3,531,110 (100.0%)

Total basepairs processed: 1,772,617,722 bp
Read 1: 886,308,861 bp
Read 2: 886,308,861 bp
Total written (filtered): 1,717,728,512 bp (96.9%)
Read 1: 831,419,902 bp
Read 2: 886,308,610 bp

And demux-paired using both forward and reverse barcode (which are the same):

qiime cutadapt demux-paired \
  --i-seqs qiime2-2021.4/1038_IGO_10763_2_S2_paired.qza \
  --m-forward-barcodes-file meta/sample_barcode_barcode_alphanumeric_header.txt \
  --m-forward-barcodes-column BarcodeForward \
  --m-reverse-barcodes-file meta/sample_barcode_barcode_alphanumeric_header.txt \
  --m-reverse-barcodes-column BarcodeReverse \
  --o-per-sample-sequences qiime2-2021.4/1038_IGO_10763_2_S2_paired_demux_fb.qza \
  --o-untrimmed-sequences qiime2-2021.4/1038_IGO_10763_2_S2_paired_demux_fb_unmatched.qza \
  --verbose > 2021-06-07_qiime_cutadapt_demux-paired_forwardbackwardbarcodes.log

Getting this summary:

=== Summary ===

Total read pairs processed: 3,531,111
Read 1 with adapter: 2,957,894 (83.8%)
Read 2 with adapter: 2,957,894 (83.8%)
Pairs that were too short: 0 (0.0%)
Pairs written (passing filters): 3,531,111 (100.0%)

Total basepairs processed: 1,772,617,722 bp
Read 1: 886,308,861 bp
Read 2: 886,308,861 bp
Total written (filtered): 1,677,720,520 bp (94.6%)
Read 1: 838,737,445 bp
Read 2: 838,983,075 bp

I was really convinced that inputting both forward and reverse barcodes was the way to go. Until I ran qiime demux summarize and qimme tools view and realized that for one of my samples, using only forward barcodes recovered about 65k reads, but using both forward and backward barcodes recovered only 26 reads...

I don't know if this is also because of the mix-orientation problem in my R1 and R2. But it does seem to be the case that for this particular sample, the adaptor GCTTGAGCTTGA is not present in matching reads in R1 and R2, unlike the example in my main post above...

For example, R1

677-@M00430:5389:000000000-J2L69:1:2106:15574:2235 1:N:0:ATAGTCTG+GCATTGCA
678:GGCTTGAGCTTGAACTGGGTGTAAAGTGCGTGTAGCCGGGAAGGCAAGTCAGATGTGAAATCCACGGGCTCAACTCGTGAACTGCATTTGAAACTGTTTTTCTTGAGTATCGGAGAGGCAATCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGATTGCTGGACGACAACTGACGGTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGG

R1:

677:@M00430:5389:000000000-J2L69:1:2106:15574:2235 2:N:0:ATAGTCTG+GCATTGCA
678-ACATGCATGCCCCGTCAATTTCTTTGAGTTTCAACCTTGCGATCGTACTCCCCAGGTGGAATACTTATTGTGTTAACTGCGGCACGCAGGGGGTCAGTCCCCGCACACCTAGTATTCATCGTTTACATCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCGCCTCACCGTCCGTTGTCGTCCAGCAATCCGCCTTCGCCACTGTTGTTCCTCCTAATCTCTACGCATTTCACCGCC

Would really appreciate any pointers on further diagnostics and how to move forward!

lizgehret · July 20, 2021, 4:12pm

Hi @foreverwander, welcome to the :qiime2: forum!

I'm checking in with a few folks on this one, and will circle back with you soon! Thanks for your patience!

lizgehret · July 21, 2021, 10:04pm

Hi @foreverwander, thanks again for your patience here!

So there are a couple of unusual things going on here:

The same barcode is being used for your forward and reverse primer - dual index barcodes are usually utilizing 2 different barcodes to increase multiplexing. When you de-multiplex with just your forward reads everything seems to work out, but when you add in the reverse barcodes is where we start seeing some issues. Something fishy might be going on with your barcode sequences in your reverse barcode file.

Regardless, if the same barcode is being used both on the forward and reverse reads, you could just use the forward only reads to de-multiplex since that has good results (which you have already done). Then in a separate step you can simply use cutadapt trim-paired to remove the primers from both reads which will also trim off the remaining barcodes/adapters.

It also appears as though the barcode sequences in your .fastq header say that your sequences have already been demultiplexed.

@Mehrbod_Estaki and @llenzi both expressed that they would like to see more than one line for the barcode mapping file, just to confirm that the same barcode was used in R1 and R2. This would be good, because q2-demux cannot be used in any other case.

Based on the fact that your barcodes are 12 bp in length, it appears as though they are golay barcodes - in which case, q2-demux emp-paired may be a better method in this case. Please take a look at this forum post for additional details:

@thermokarst was suggesting to add -p-rev-comp-barcodes as well as --p-rev-comp-mapping-barcodes in this case because the orientation is important for golay barcodes, which may explain the confusion on direction above.

These are all guesses here, so we may be off-track on this - but as a last resort I'd recommend taking these findings back to your sequencing provider to see if they can provide any additional insight on this. Hopefully this helps you with some next steps!

Cheers,
Liz

foreverwander · July 22, 2021, 2:48pm

Thank you @lizgehret for looking into this!!

From the barcode oligos file I received, it does look like the forward and reverse barcodes are the same. I've been able to get a confirmation from the sequencing core that barcodes are supposed to be in both the forward and reverse reads. They core also confirmed that they are 12-bp Golay barcodes.

I also got another piece of information from the core which seems to suggest that since barcodes are expected on both reads, for read-pairs where the barcode is only on one read, they are "confused how this could happen". I will follow up and double check if this suggests a library prep failure.

Could you tell me a little bit more about "barcode sequences in my .fastq header"? I can confirm that my samples are definitely not demultiplexed because I only have a single R1 and R2 for 39 samples But it would be nice to understand the header better.

I will look into the suggestion of using q2-demux emp-paired, thank you very much! I avoided looking into the EMP format as I believed those were files that are not "barcode in-sequence" like mine, but I'll try anything I can get.

Here's the full barcode file for your information:

primer	AYTGGGYDTAAAGNG	CCGTCAATTYHTTTRAGT	
primer	CCGTCAATTYHTTTRAGT	AYTGGGYDTAAAGNG	
BARCODE	GGTATGGCTACT	GGTATGGCTACT	S1
BARCODE	CAACACATGCTG	CAACACATGCTG	S2
BARCODE	GCTTGAGCTTGA	GCTTGAGCTTGA	S3
BARCODE	TCGGCTTGGAAT	TCGGCTTGGAAT	S4
BARCODE	CTATGAGTCCAG	CTATGAGTCCAG	S5
BARCODE	GGTCACACATCA	GGTCACACATCA	S6
BARCODE	AACTCGCGCTAC	AACTCGCGCTAC	S7
BARCODE	TAGGTCTAGGTC	TAGGTCTAGGTC	S8
BARCODE	GCACATAGTCGT	GCACATAGTCGT	S9
BARCODE	ATCTTGGAGTCG	ATCTTGGAGTCG	S10
BARCODE	GAAGGTGAAGGT	GAAGGTGAAGGT	S11
BARCODE	AGTCCACTGGTA	AGTCCACTGGTA	S12
BARCODE	GCCGATTGTAAC	GCCGATTGTAAC	S13
BARCODE	CGTAGCCAACAT	CGTAGCCAACAT	S14
BARCODE	ACGCAATGTCTG	ACGCAATGTCTG	S15
BARCODE	TAATCGGTGCCA	TAATCGGTGCCA	S16
BARCODE	CGAGTCACGATT	CGAGTCACGATT	S17
BARCODE	AGGTTCTTAGGC	AGGTTCTTAGGC	S18
BARCODE	CAGAATCGCTCA	CAGAATCGCTCA	S19
BARCODE	TTCGCAGATACG	TTCGCAGATACG	S20
BARCODE	ACTCCGATAGAC	ACTCCGATAGAC	S21
BARCODE	GATAGCACTCGT	GATAGCACTCGT	S22
BARCODE	GTATAGTCCGTG	GTATAGTCCGTG	S23
BARCODE	CGCCTTGATAAG	CGCCTTGATAAG	S24
BARCODE	TCGGATCTGTGA	TCGGATCTGTGA	S25
BARCODE	GCTCGAAGATTC	GCTCGAAGATTC	S26
BARCODE	AGAGCATCCACT	AGAGCATCCACT	S27
BARCODE	TAGATCCTCGGA	TAGATCCTCGGA	S28
BARCODE	CTCTACGAACAG	CTCTACGAACAG	S29
BARCODE	AGTAGGAGGCAC	AGTAGGAGGCAC	S30
BARCODE	CAATTGCGTGCA	CAATTGCGTGCA	S31
BARCODE	TCCAGATAGCGT	TCCAGATAGCGT	S32
BARCODE	ACACCTGCGATC	ACACCTGCGATC	S33
BARCODE	TTGGTCTCCTCT	TTGGTCTCCTCT	S34
BARCODE	GGCTCTAACGTA	GGCTCTAACGTA	S35
BARCODE	GAGCGAGTTAGG	GAGCGAGTTAGG	S36
BARCODE	CTTGTGCGACAA	CTTGTGCGACAA	S37
BARCODE	GGTGACTAGTTC	GGTGACTAGTTC	S38
BARCODE	AACACTCGATCG	AACACTCGATCG	S39
#V4_V5.oligos.barcodes

Thank you very much!!

I'll be back.

Thanks,
Lydia

lizgehret · July 26, 2021, 6:15pm

Hi @foreverwander,

Thanks for following up and providing those details!

That is something that @llenzi had mentioned during my internal discussion, so they may be able to speak to that further!

I think that is a great next step to confirm on your end - that may help provide you with some additional clarification!

Great! Feel free to reach back out if you run into any issues with that or have further questions!

Cheers,
Liz

llenzi · July 27, 2021, 8:28am

Hi @foreverwander

Just to clarify, the fact that in the header of your sequences there are:

suggests me that your sequences were previously demultiplexed and the above barcodes were used (which look like Nextera barcodes, being only 8 bp long).
The simple explanation may be that the sequencing centre do run few project together to optimise the runs, and that is why they have been already demultiplexed. Nothing wrong with this, should not affect your data quality as far I know!

If your provider did the library prep, they should be able to point you toward the correct demultiplexing way too!
Let us know how is going!
Cheers
Luca

system · August 27, 2021, 2:28pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.