I hope this question can help others working with paired-end sequences that have forward.fastq.gz, reverse.fastq.gz, and a metadata table containing a column with barcodes. My main problem is that the cutadapt demultiplexing identifies very sequences per sample. I believe this is due to a barcode issue.
My methods:
To import the data into qiime2 I followed this qiime2 tutorial.
qiime tools import \
--type MultiplexedPairedEndBarcodeInSequence \
--input-path sequences \
--output-path multiplexed-seqs.qza
To demultiplex the data I used the cutadapt command:
qiime cutadapt demux-paired \
--i-seqs multiplexed-seqs.qza \
--m-forward-barcodes-file metadata.tsv \
--m-forward-barcodes-column barcode-sequence \
--p-error-rate 0.1 \
--output-dir demultiplexed_data
To summarize the data I followed the basic code:
qiime demux summarize \
--i-data demultiplexed_data/per_sample_sequences.qza \
--o-visualization demultiplexed_data.qzv
The problem I encountered was that almost no reads were found for some samples.
I used a simple grep search of the unzipped forward reads (gunzip forward.fastq.gz
) and found that there were a lot of errors in the barcodes themselves. For example, all barcodes started with "TAAGGCGA" followed by 8 unique nucleotides. I used the grep command grep -o TAAGGCGA........ forward.fastq | sort | uniq
and identified 163 unique variations of the barcodes. None of these extracted barcodes exactly matched the barcodes I supplied in my metadata.tsv file.
I increased the number of sequences identified for each sample by allowing more errors in the barcode matching with the following command:
qiime cutadapt demux-paired \
--i-seqs multiplexed-seqs.qza \
--m-forward-barcodes-file metadata.tsv \
--m-forward-barcodes-column barcode-sequence \
**--p-error-rate 0.2** \
--output-dir demultiplexed_data_error.2
This identified more sequences. However, I am left feeling unsure about the demultiplexing process.
Questions:
Am I right in assuming that my main issue stems from the barcodes?
Is it normal to not have a perfect match for the barcodes?
Or is this user error?
I am using qiime2-2021.4, on a windows 10 computer using the windows subsystem for Linux. I have used this computer to analyze microbiome data previously so I don't think it is my hardware.
Thank you for your time and support.