Barcode Errors Impair Demultiplexing with Cutadapt?

Stephan_Bitterwolf · July 29, 2021, 4:41pm

I hope this question can help others working with paired-end sequences that have forward.fastq.gz, reverse.fastq.gz, and a metadata table containing a column with barcodes. My main problem is that the cutadapt demultiplexing identifies very sequences per sample. I believe this is due to a barcode issue.

My methods:
To import the data into qiime2 I followed this qiime2 tutorial.

qiime tools import \
  --type MultiplexedPairedEndBarcodeInSequence \
  --input-path sequences \
  --output-path multiplexed-seqs.qza

To demultiplex the data I used the cutadapt command:

qiime cutadapt demux-paired \
--i-seqs multiplexed-seqs.qza \
--m-forward-barcodes-file metadata.tsv \
--m-forward-barcodes-column barcode-sequence \
--p-error-rate 0.1 \
--output-dir demultiplexed_data

To summarize the data I followed the basic code:

qiime demux summarize \
--i-data demultiplexed_data/per_sample_sequences.qza \
--o-visualization demultiplexed_data.qzv

The problem I encountered was that almost no reads were found for some samples.

I used a simple grep search of the unzipped forward reads (gunzip forward.fastq.gz) and found that there were a lot of errors in the barcodes themselves. For example, all barcodes started with "TAAGGCGA" followed by 8 unique nucleotides. I used the grep command grep -o TAAGGCGA........ forward.fastq | sort | uniq and identified 163 unique variations of the barcodes. None of these extracted barcodes exactly matched the barcodes I supplied in my metadata.tsv file.

I increased the number of sequences identified for each sample by allowing more errors in the barcode matching with the following command:

qiime cutadapt demux-paired \
--i-seqs multiplexed-seqs.qza \
--m-forward-barcodes-file metadata.tsv \
--m-forward-barcodes-column barcode-sequence \
**--p-error-rate 0.2** \
--output-dir demultiplexed_data_error.2

This identified more sequences. However, I am left feeling unsure about the demultiplexing process.

Questions:
Am I right in assuming that my main issue stems from the barcodes?
Is it normal to not have a perfect match for the barcodes?
Or is this user error?

I am using qiime2-2021.4, on a windows 10 computer using the windows subsystem for Linux. I have used this computer to analyze microbiome data previously so I don't think it is my hardware.

Thank you for your time and support.

lizgehret · August 2, 2021, 7:24pm

Hi @Stephan_Bitterwolf, thanks for reaching out!

Based on the test you ran by increasing the error rate, my assumption here would be that the main issue is caused by your barcodes - but I will tag a couple of folks on this to see what additional insight we can provide here. Thanks for your patience!

lizgehret · August 4, 2021, 12:17am

Hi @Stephan_Bitterwolf,

Thanks again for your patience! Here are some thoughts on this:

It's hard to tell exactly what the issue is here without knowing the full set up of your experiments or looking at the raw reads/barcodes, but the most likely culprits are:

The presence of other non-biological nts before your barcodes (which you have already said you noticed). You can add those to the barcode columns so they match exactly.
Your barcodes may be in some other orientation - try checking whether the complement, or reverse complement of the barcodes match anything in the reads.
The reads may be in mixed orientation, which might be workable with the --p-mixed-orientation parameter.

Overall though, we wouldn't recommend messing with increasing the error-rate; that starts to enter some shaky territory. q2-demux does support error correction but it is designed for 12nt Golay, not 8 nt Hamming. For 8nt correction you'll have to work with something outside of QIIME 2. QIIME 1 split_libraries.py should be able to handle that, which may be a good next step for you to look into.

Hope this helps!

Cheers,
Liz

system · September 4, 2021, 6:17am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.