Demultiplexing Single-end assigns most reads to 1 sample

ahale004 · July 27, 2023, 11:05pm

Hi all, I'm new to Qiime2 and bioinformatics in general.

I'm not sure how to troubleshoot this issue! About 75% of total reads are assigned to one sample after de-multiplexing. I haven't found a post with my exact issue - but maybe I'm using the wrong keywords. I made an earlier post where I originally thought the problem was DADA2, and thought I should make a new post.

I'm using qiime2-2023.5 in conda. The data is 300bp single-end multiplexed Illumina Miseq data, custom barcodes included, forward reads only. The data does not include primers or adapters. I get the same issue with an ITS library and a custom amplicon library. I know there isn't an issue with the run data itself, since this issue didn't arise when we used UPARSE on the same data.

Troubleshooting so far:

Barcodes: I've checked that the barcodes in my mapping file match the barcodes in the original fastq file. (used grep to search the headers in the fastq with ':barcode-sequence'. Each one got thousands of hits, so the mapping file is correct).

demultiplexed-seqs (3).qzv (303.5 KB)

I'm not sure this info. also helps understand what happened - but the DADA2 de-noising step filters almost all of my sample reads, except for the one sample with ~10 million reads (~60% pass filter). For most of the 96 samples, less than 1% of reads are passing the filter.

Many thanks in advance for any ideas on how to troubleshoot further!

colinvwood · July 28, 2023, 6:37pm

Hello @ahale004,

I know there isn't an issue with the run data itself, since this issue didn't arise when we used UPARSE on the same data.

The reads were more evenly spread across samples? Can you show a screenshot or something similar?

Barcodes: I've checked that the barcodes in my mapping file match the barcodes in the original fastq file. (used grep to search the headers in the fastq with ':barcode-sequence'. Each one got thousands of hits, so the mapping file is correct).

Did the number of hits for a barcode match the number of reads allotted to the sample that that barcode represents?

ahale004 · July 28, 2023, 8:16pm

The reads were more evenly spread across samples? Can you show a screenshot or something similar?

yes!
UPARSE:

Qiime2:

Did the number of hits for a barcode match the number of reads allotted to the sample that that barcode represents?

They don't match!

Grep 1st sample: F001.155
grep -c :CTCGACTACTGA JB155_FC1577L1P1.fastq
=98,867
Dmux qzv file read count:
F001.155 = 82,066

grep the sample with the most reads (#32):
grep -c :GCACTGCTGAGA JB155_FC1577L1P1.fastq
= 217,043
Dmux qzv file read count:
F032.155 =10,190,080

colinvwood · July 31, 2023, 4:44pm

Hello @ahale004,

You used qiime cutadapt demux-single, the help text for which says:

Demultiplex sequence data (i.e., map barcode reads to sample ids). Barcodes
are expected to be located within the sequence data (versus the header, or a
separate barcode file).

You said that your barcodes are in the header, so that's probably your problem.

ahale004 · July 31, 2023, 4:50pm

Thanks so much for helping Colin! I think you're right that I used the wrong demultiplexing option.

I think we found a fix (below with notes), because the reads per barcode now exactly match their counts in the original fastq file.

Notes & code:

I think I misunderstood which demultiplexing command to use, because our sequencer does not send a barcode.fastq file separately but instead imbeds it into one fastq file. We switched from the 'barcodes in sequence' option to the demux emp-single option for demultiplexing.
We also have custom barcodes which may not be Earth Microbiome Protocol, so we turned off the Golay option.

For anyone who has a similar issue, here are the commands I used:

We originally used:
qiime tools import
--type MultiplexedSingleEndBarcodeInSequence
--input-path JB155_FC1577L1P1.fastq.gz
--output-path multiplexed-seqs.qza

qiime cutadapt demux-single
--i-seqs multiplexed-seqs.qza
--m-barcodes-file JB155_map_k_q.txt
--m-barcodes-column BarcodeSequence
--p-error-rate 0
--o-per-sample-sequences demultiplexed-seqs.qza
--o-untrimmed-sequences untrimmed.qza
--verbose

The fix: We used a different import and demux command, and turned the Golay option to off. This import option required us to use a Qiime1 command to extract a barcodes.fastq.gz file from the fastq file first.

mkdir emp-paired-end-sequences

#Need a command to separate the barcodes.fastq.gz from the fastq.gz file. Follow directions for making the directory and file names exactly, with no extra files in the directory.

qiime tools import
--type EMPSingleEndSequences
--input-path emp-single-end-sequences
--output-path emp-single-end-sequences1.qza

qiime demux emp-single
--i-seqs emp-single-end-sequences1.qza
--m-barcodes-file JB155_map_k.tsv
--m-barcodes-column BarcodeSequence
--p-no-golay-error-correction
--o-per-sample-sequences nogolaydemux.qza
--o-error-correction-details nogolaydemux-details.qza
--verbose

colinvwood · July 31, 2023, 5:14pm

Hello @ahale004,

This is an interesting workaround. I think you're right that qiime2 doesn't really support your use case--non EMP sequences with barcodes in the headers. If this is working for you then great.

What is the reasoning behind disabling golay error correction? I believe this just helps to account for sequencing errors in the barcode sequences and isn't EMP specific. It makes sense that your grep counts line up with the golay-disabled counts because these are both exact matches only. However, you'll probably want to account for the mismatches too.

ahale004 · July 31, 2023, 8:52pm

Hi Colin,
We did try this with Golay on, here's the error code it threw:

Plugin error from demux:

No sequences were mapped to samples. Check that your barcodes are in the correct orientation (see the rev_comp_barcodes and/or rev_comp_mapping_barcodes options). If barcodes are NOT Golay format set golay_error_correction to False.

See above for debug info.

We believe the barcodes are in the correct orientation. So, we think that the barcodes are not in the Golay format requried for EMP, since they are also not EMP.

colinvwood · July 31, 2023, 11:32pm

Hello @ahale004,

Ah, ok. That was a misunderstanding on my part. I thought that golay was an error correcting algorithm, not a format.

system · September 1, 2023, 5:32am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.