Metadata for start

Hi Dear Matthew and all

I have started doing some online tutorial on QIIME2, almost all but I found it a bit confusing as a beginner to start,
I have the result of 51 samples back (bacterial 16s ampllicon seq).

There are two files for each sample called R1.fastq.gz and R2.fastq.gz and a pdf containing index 1 and index 2 that should be barcodes I think,
I have visited qiime2 website to make a metadata file to start, also found a tutorial about, but my samples do not have all those information: "#SampleID/BarcodeSequence/LinkerPrimerSequence/Treatment/DOB/Description"

  1. Which barcodes I need to use to make the metadata file ?

  2. One column for #sample Id, then two different columns for index 1 and index 2 in a excel sheet ?
    I think based on information in the pdf primers have been trimmed off,

  3. After making the this file, how to clean the barcodes from R1 and R2 files (The barcodes has been separated with a + sign) ?

  4. This process should be done for each sample separately ?

For each sample when I do zcat fastq.gz | head, there are something like:

@M02501:63:000000000-BT3BH:1:1101:14311:1491 1:N:0: CGTACTAG+TATCCTCT
I am using qiime2 on virtual box (q2cli version 2018.4.0) using my office computer, at the moment I don't have server connection.


I really appreciate your help

Hi @Fatemah,
Welcome to :qiime2:!

Sorry you're having a slow start, if you have any recommendations as how you think the tutorials could be fine-tuned please do share them here!

From your description it sounds like your sequencing facility has already demultiplexed your sequences for you so you can use the manifest import option to import your reads into qiime2.

Since you already have demultiplexed files you won't need this column in your metadata file anymore. Just make sure that in addition to the primers, these barcodes have also been trimmed from your sequences.

You are right, the i7 and i5 indexes have been removed. The facility likely offers you this information just for your records.

A fastq holds 4 linse of information per read.
1st line: @M02501:63:000000000-B... <-- this header line contains info about the machine/run and in your case it looks like as though the barcodes have also been stored there for your reference
2nd line: GTGCCAGCAGCCGCGG....Your actual sequences
3rd line: the + sign is just a holder, meaning this line holds no important information but in theory could for some other uses.
4th line: CCCCCGGGGGGGGG... these are your quality scores that correspond to each sequence from line 2.
You can read more here if you want.

Not needed. Start at the manifest import link above and let us know if you run into any problems!

Good luck.

1 Like

Hi Mehrbod
tnx for your reply,
I asked seq facility and said they have not trimmed the barcodes off (indices 1 and 2) but they did the primers , How did u find out the barcodes have been trimmed off ? if they have trimmed the barcodes off why the barcodes are still present in R1.fastq.gz R2.fastq.gz files and separated by a + sign in the first line (@M02501:63:000000000-BT3BH:1:1101:14311:1491 1:N:0: CGTACTAG+TATCCTCT),
I also did not receive a separate file for barcodes all are index 1 and 2 in a pdf file.
All I received are R1 and R2 and a pdf containing index 1 and 2 (barcodes) and number of reads,
Could you plz have a look at the pics I sent u in my previous post ,

Correct me if I’m wrong, but I don’t think this is physically possible unless they are doing their own (very strange) bioinformatic steps. The primers come after the barcodes, so that means they are somehow re-attaching the barcodes, but choosing not to hand you demultiplexed data?

1 Like

Hi @Fatemah,
I’ll echo what @ebolyen said, the primers come after the barcodes so I’m not sure how they could not have trimmed off the barcodes if they trimmed the primers. Unless what they actually mean is that they just moved the barcodes onto the 1st line of your FASTQ, which is something we’ve seen before.

I just looked at the example you provided which clearly had the i7 and i5 barcodes added to the first line and not in the sequences themselves. Like I mentioned, sometimes sequencing facilities will demultiplex and trim barcodes off but just move them into the header/first line for your own record. They serve no useful purpose there though once they are demultiplexed. As so, you don’t need a separate barcode file since these files are already demultiplexed. Did you try importing using the manifest approach I linked above?

Dear Mehrbod ,
I hope u r fine ,
I think they removed the barcodes and the first line contains index1+index2 in R1 and R2 fastq.gz files wouldn’t interfere the rest analyses:not sure. What do you think , so I will let the fastq.gz files be as it is,
I am trying to fix the manifest file like the link u sent me . I am not sure is it a good format ? Could u plz have a look,
May I now after making this file (manifest file) which command I can use to transfer all the files to qiime2 on virtual box, I am learning qiime2 all alone that is why I may have lots of difficulties but really appreciate ur help,
after importing the data then I won’t need to do anything for barcodes as we think they have been removed ?
I have 51 samples,


Hi Mehrbod :sunflower:

I just changed the manifest file to this one,I think this one is more complete. I have many errors like: Forward and reverse reads must be provided exactly one time each for each sample. The following samples had forward but not reverse read fastq files: HH32_S32_L001_R1_001.fastq.gz, HH41_S41_L001_R1_001.fastq.gz, HH40_S40_L001_R1_001.fastq.gz, HH46_S46_L001_R1_001.fastq.gz,…

but I have checked the manifest file and all the files separately to make sure R1 and R2 exist,
I used:
qiime tools import --type ‘SampleData[PairedEndSequencesWithQuality]’ --input-path /home/qiime2/Documents/file/manifest_data.csv --output-path demux.qza --source-format PairedEndFastqManifestPhred64
and also
qiime tools import --type ‘SampleData[PairedEndSequencesWithQuality]’ --input-path /home/qiime2/Documents/file/manifest_data.csv --output-path demux.qza --source-format PairedEndFastqManifestPhred33
not sure about the difference between 33 and 64 or what is the source format I just tried both ,


HI @Fatemah,

From what I can tell based on what you have posted your files are demultiplexed, dual indexing barcodes have been removed from your sequences and moved to the header line, and the overhang primers removed as well. This is exactly what we would want for the manifest importing approach so I don’t expect any problems as is.

Regarding your manifest files, are you attaching files for me to look at? Because I am not able to see anything attached. If you are having problems with attachments you can try sending them to me in a direct message or we can figure out another way to share that.

The Phred33 version is probably what you want to use. The Phred64 format is based on older machines so you probably won’t have data in that format. You could always confirm by asking your sequencing facility as well.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.