How to demultiplex sequences with in-read barcode and illumina indexes


I have a dataset of sequences from pollen collected from birds. We have amplified each sample three times. The samples come from different elevations and seasons which represent four expeditions. Each of the samples was amplified with different forward primers whereas the reverse primer was always the same. We have done this for two markers ITS-2 and ITS-1.Then the triplicates were combined into one sample. Subsequently, we produced partial libraries combining samples from both markers. Then, these partial libraries were tag with Illumina tags. I received from the sequencer each of the libraries splitted according to the tagged used. Now I want to analyze them however I am confused about how to make the metadata (for demultiplexing) file in order to split each one of the samples into the original samples.
Can someone help me out with this?

I understand I will have to run the pipeline two times, one for each of the markers (ITS-1 and ITS-2). However, what I am having problems with is how to map the sequences to the original samples. Basically the problem is that I have combined internal barcode indexing (barcode in the read sequence) together with Illumina indexing ( in I5 and I7). This means that fastq files transferred by sequencing facility for such libraries are partially demultiplexed i.e. they have been only demultiplexed according to Illumina indexing and must be further demultiplexed.

So now I have c. 200 folder (both for r1 and R2) which correspond to the Illumina tags used, however each of these folders contain sequences coming from 5 to 6 original samples.


Hi @Guillermo_U,

Welcome back to the :qiime2: forum!

Apologies for the delay in response on this - we are working on this, and will follow up with you soon! Thanks!

Thanks a lot @lizgehret !!! I will wait patienly XD

Based on what I understand from what you have described, it sounds like you will need to demultiplex(probably using q2-cutadapt(docs) all of your folders based on your barcodes, then denoise + join, possibly using q2-dada2(docs) to produce feature tables, then you can use feature-table merge(docs) to put your data back together. If it seems like I have missed something about what you have or are trying to accomplish let me know and lets see if we can come up with another solution.

However, if any of the other @moderators have a better workflow I would love to hear about it!

Hello @Keegan-Evans,

First of all thank you very much for taking your time in asnwering! and sorry if I am not understanding you or I am missing some information. I am rather new to qiime2

That seems logical however I am not sure if it fits my needs. My files are already demultiplexed according to the illumina tags used. Therefore, they have the names of the partial libraries I produced in the lab which are a mixture of samples all tagged differently in a first round of PCR. Thus, the illumina indices used are already in the header of each sequence.

I have been checking the information on inporting pair-end sequences with within-read barcodes and the names "forward.fastq.gz" and "reverse.fastq.gz" are required. So, How can I deal with the situation of having 200 hundred different folders whose names encode information necessary for the ID of the samples?

Basically I need to combine the illumina tags (already on the header of the sequences) together with the within-read barcodes and primers to map the sequences to the original samples?

Now lets imagine: we have 12 different samples which were collected in different times and locations. So what we have done is the following:

- Samples 1, 2, 3. Amplified in triplicate with fwd primers F1,F2,F3 and rev primer R. Triplicates pooled together into sample A.
- Samples 4, 5, 6. Amplified in triplicate with fwd primers F4, F5, F6, and rev primer R. Triplicates pooled together into sample B.
- Samples 7, 8, 9. Amplified in triplicate with fwd primers F7, F8, F9, and rev primer R. Triplicates pooled together into sample C.
- Samples 10, 11, 12. Amplified in triplicate with fwd primers F10, F11, F12, and rev primer R. Triplicates pooled together into sample D.

Each of the fwd primers used has a unique barcode whereas the barcode of the rev primer is alway the same. These barcodes encode not only sample ID but information on where and when the samples was taken.
Then samples A, B, C, D were combined to produce "partial libraries (PLs)" and tagged with Illumina indices:


- PL1: A+B. Tagged with illumina indices A701- A501
- PL2: C+D. Tagged with illumina indices A702-A502

Final library: PL1+PL2
So now, I have my sequences explitted according to the Illumina tags. In the example it would be like having individual fastq.gz files of PL1 and PL2.

How can I deal with this situation? is qiime 2 able to do it?

Thanks a lot and sorry for the reply thick as as I brick!

1 Like

Hmm, I may have to spend some time thinking on this or asking around to other @moderators, who have more relevant experience than I do!

Good morning, Guillermo,

Thank you for your detailed description of your multiplexing method. That is quite the technique, but I think I followed along.

It should be possible to take the ID from the folder name, and add it into the header of the inclosed fastq files. So

$> ls folder_ITag22

$> head -n1 folder_ITag22/sample_R1_Index1.fastq

Normally you add sample ID based on file name:
@SIM:1:FCX:1:15:6329:1045:GATTACT+GTCTTAAC 1:N:0:ATCCGA Index1

But you want to add both file name and folder name
@SIM:1:FCX:1:15:6329:1045:GATTACT+GTCTTAAC 1:N:0:ATCCGA ITag22_Index1

Is this sounding like that you want?

Did your team develop this method, or is there an origional source you can site? If it's been used before, how did they demultiplex it?

1 Like