I hope this question can help others working with paired-end sequences that have forward.fastq.gz, reverse.fastq.gz, and a metadata table containing a column with barcodes. My main problem is that the cutadapt demultiplexing identifies very sequences per sample. I believe this is due to a barcode issue.
To import the data into qiime2 I followed this qiime2 tutorial.
qiime tools import \ --type MultiplexedPairedEndBarcodeInSequence \ --input-path sequences \ --output-path multiplexed-seqs.qza
To demultiplex the data I used the cutadapt command:
qiime cutadapt demux-paired \ --i-seqs multiplexed-seqs.qza \ --m-forward-barcodes-file metadata.tsv \ --m-forward-barcodes-column barcode-sequence \ --p-error-rate 0.1 \ --output-dir demultiplexed_data
To summarize the data I followed the basic code:
qiime demux summarize \ --i-data demultiplexed_data/per_sample_sequences.qza \ --o-visualization demultiplexed_data.qzv
The problem I encountered was that almost no reads were found for some samples.
I used a simple grep search of the unzipped forward reads (
gunzip forward.fastq.gz) and found that there were a lot of errors in the barcodes themselves. For example, all barcodes started with "TAAGGCGA" followed by 8 unique nucleotides. I used the grep command
grep -o TAAGGCGA........ forward.fastq | sort | uniq and identified 163 unique variations of the barcodes. None of these extracted barcodes exactly matched the barcodes I supplied in my metadata.tsv file.
I increased the number of sequences identified for each sample by allowing more errors in the barcode matching with the following command:
qiime cutadapt demux-paired \ --i-seqs multiplexed-seqs.qza \ --m-forward-barcodes-file metadata.tsv \ --m-forward-barcodes-column barcode-sequence \ **--p-error-rate 0.2** \ --output-dir demultiplexed_data_error.2
This identified more sequences. However, I am left feeling unsure about the demultiplexing process.
Am I right in assuming that my main issue stems from the barcodes?
Is it normal to not have a perfect match for the barcodes?
Or is this user error?
I am using qiime2-2021.4, on a windows 10 computer using the windows subsystem for Linux. I have used this computer to analyze microbiome data previously so I don't think it is my hardware.
Thank you for your time and support.