Barcode Errors Impair Demultiplexing with Cutadapt?

I hope this question can help others working with paired-end sequences that have forward.fastq.gz, reverse.fastq.gz, and a metadata table containing a column with barcodes. My main problem is that the cutadapt demultiplexing identifies very sequences per sample. I believe this is due to a barcode issue.

My methods:
To import the data into qiime2 I followed this qiime2 tutorial.

qiime tools import \
  --type MultiplexedPairedEndBarcodeInSequence \
  --input-path sequences \
  --output-path multiplexed-seqs.qza

To demultiplex the data I used the cutadapt command:

qiime cutadapt demux-paired \
--i-seqs multiplexed-seqs.qza \
--m-forward-barcodes-file metadata.tsv \
--m-forward-barcodes-column barcode-sequence \
--p-error-rate 0.1 \
--output-dir demultiplexed_data

To summarize the data I followed the basic code:

qiime demux summarize \
--i-data demultiplexed_data/per_sample_sequences.qza \
--o-visualization demultiplexed_data.qzv

The problem I encountered was that almost no reads were found for some samples.

I used a simple grep search of the unzipped forward reads (gunzip forward.fastq.gz) and found that there were a lot of errors in the barcodes themselves. For example, all barcodes started with "TAAGGCGA" followed by 8 unique nucleotides. I used the grep command grep -o TAAGGCGA........ forward.fastq | sort | uniq and identified 163 unique variations of the barcodes. None of these extracted barcodes exactly matched the barcodes I supplied in my metadata.tsv file.

I increased the number of sequences identified for each sample by allowing more errors in the barcode matching with the following command:

qiime cutadapt demux-paired \
--i-seqs multiplexed-seqs.qza \
--m-forward-barcodes-file metadata.tsv \
--m-forward-barcodes-column barcode-sequence \
**--p-error-rate 0.2** \
--output-dir demultiplexed_data_error.2

This identified more sequences. However, I am left feeling unsure about the demultiplexing process.

Am I right in assuming that my main issue stems from the barcodes?
Is it normal to not have a perfect match for the barcodes?
Or is this user error?

I am using qiime2-2021.4, on a windows 10 computer using the windows subsystem for Linux. I have used this computer to analyze microbiome data previously so I don't think it is my hardware.

Thank you for your time and support.

Hi @Stephan_Bitterwolf, thanks for reaching out!

Based on the test you ran by increasing the error rate, my assumption here would be that the main issue is caused by your barcodes - but I will tag a couple of folks on this to see what additional insight we can provide here. Thanks for your patience!

Hi @Stephan_Bitterwolf,

Thanks again for your patience! Here are some thoughts on this:

It's hard to tell exactly what the issue is here without knowing the full set up of your experiments or looking at the raw reads/barcodes, but the most likely culprits are:

  • The presence of other non-biological nts before your barcodes (which you have already said you noticed). You can add those to the barcode columns so they match exactly.
  • Your barcodes may be in some other orientation - try checking whether the complement, or reverse complement of the barcodes match anything in the reads.
  • The reads may be in mixed orientation, which might be workable with the --p-mixed-orientation parameter.

Overall though, we wouldn't recommend messing with increasing the error-rate; that starts to enter some shaky territory. q2-demux does support error correction but it is designed for 12nt Golay, not 8 nt Hamming. For 8nt correction you'll have to work with something outside of QIIME 2. QIIME 1 should be able to handle that, which may be a good next step for you to look into.

Hope this helps!


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.