I am new to Qiime2. I have questions about importing sequence data.
I have barcode sequences. How can I prepare or generate a barcodes.fastq.gz file? I know convert_fastaqual_fastq.py in Qiime 1 can convert fasta to fastq, but I do not have a QUAL file for the barcodes.
Before I import data, should I rename the sequence file to “sequences.fastq.gz” and rename the barcode file to “barcodes.fastq.gz”?
If your data is multiplexed, take a look at the “EMP multiplexed format”. This is currently the only multiplexed format supported by QIIME 2 so you’ll need to get your data into that format or demultiplex your data with an external tool and import the demux data into QIIME 2. Please let me know if you have specific questions about that.
Yes, you’ll need to use those file names when importing. After importing your data, you can name your .qza/.qzv files whatever you’d like (it makes no difference to QIIME 2).
Thanks for your reply.
My data is still multiplexed. I know the barcode sequences for each sample, but I do not have a barcodes.fastq.gz file. Can you advise how to get a barcodes.fastq.gz file for my barcodes?
What file format are your barcodes in? Can you post an example of what those look like?
Solely knowing what barcode is used for each sample isn’t sufficient to demultiplex your data in QIIME 2. You’ll need to have a .fastq file containing your sequences, and another .fastq file that has each sequenced barcode associated with your sequence data. Each sequence needs to have a corresponding barcode in the barcodes .fastq file so that QIIME 2 will know which sample a sequence belongs to. The sequences and barcodes .fastq files need to have their records in the same order as well.
If the barcodes are contained in your raw sequence data, there currently isn’t a way to extract them into a separate file in QIIME 2. I think QIIME 1’s extract_barcodes.py can do that for you (or another external tool).
I only know what the barcode sequences are (e.g. barcode 1: acagca). I think I should give more details about how we generated the data.
We have 20 tagged primer pairs (each primer sequence contains a 6-base tag (barcode)). We pooled PCR product of 20 samples (each sample was PCR amplified using a different tagged primer pair) into a library, then we ligated an index to each library using the Illumina Truseq PCR-Free kit. We then pooled a few libraries for a pair-ended Miseq run.
After the Miseq run, we obtained two sequence files for each library. In the sequence files, the 5’ end of sequences do not contain adapter sequences, but the 3’ end of the sequences may still contain adapters. Therefore, I used AdapterRemoval (https://github.com/MikkelSchubert/adapterremoval) to remove adapters and merged the two files. Now, I have a merged sequence file for each library. Each library contains sequences of 20 samples. I want to use the 6-base barcodes to demuliplex the 20 samples for each library. I try to import the data using QIIME2 following the “EMP protocol” multiplexed single-end fastq tutorial, but do not know how to get the barcode fastq file.
How do other people get the barcodes.fastq file when they do Miseq runs? Is there any way that I can use QIIME2 to analyze my data? If not, I will use QIIME 1.
Your library prep method differs from the EMP protocol (in which the MiSeq produces separate sequence and barcode fastqs), hence the EMP data importing and demux commands will not work. However, I do believe we can get your data into QIIME2 if you follow the steps below.
It sounds like you do have QUAL scores for your barcodes, but not a separate barcode read. Instead, the barcodes would be the first 6 nt at the 5’ of each sequencing read in your merged fastq file. We need to split those out into a separate file, along with the QUAL scores associated with those bases.
A solution already exists for this in QIIME1: extract_barcodes.py. Use that command to split your merged fastq file into separate barcode and sequence files. Rename these files per convention, and per @jairideout 's answer to your question #2. Then you can upload these files into QIIME2 and process as if you have reads prepared using EMP protocols (unless if q2_demux.emp_single does not support 6nt barcodes but I don’t think this is enforced — @jairideout or @gregcaporaso do you know?).
Just for reference, we have an open issue to add support for importing this type of multiplexed sequence data, where the barcodes are contained within the sequences. We’ll follow up here when it’s available in a QIIME 2 release!