Demultiplexing paired end FASTQ files using a mapping file

I downloaded multiplexed paired-end read Fastq files from a study conducted on dust samples. I have a text file named dust_mapping.txt which contains information about the barcodes and primers associated with each sequence. I also have 4 fastq files (each of nearly 1.5 GB size) with filenames:
dust1_R1.fastq (forward),
dust1_R2.fastq (reverse),
dust2_R1.fastq (forward),
dust2_R2.fastq (reverse)
The study says that they used 155 samples. Therefore, these files are NOT demultiplexed files, otherwise I should have got 155 fastq files (each for forward and reverse reads). The problem is I don’t have a fastq.gz file of barcodes to use the following command to demultiplex my data. I rather have a text file which contains a lot of other information apart from barcodes. Do I need to extract the barcodes and convert a text file into a fastq file? If YES, how can I do that? Also, how do I know how many samples are there in dust1_R1 vs dust2_R1?
Please Help!
Link to download data: https://figshare.com/articles/Lillis_Dust_Sequencing_Data/709596

Hi @SAHIL_JAIN_16110144,

Welcome to the forum! You are right in that these are multiplexed files that need to be demultiplexed.

You don’t need separate barcode files these since the barcodes are already in the reads.

And since these are dual-indexed barcode files you’ll want to use the cutadapt demux-paired plugin to demultiplex these, using the mapping file provided as the metadata file which contains barcodes information.
The one thing you do need to do however is modify that mapping file so it has a separate forward and reverse barcode column. If you read the description from figshare they used 6 bp dual barcodes but in the mapping file they are combined into one column that holds 12bp sequences. You’ll just want to reverse that process and separate the barcodes to 2 columns to match the requirement for cutadapt. As for what orientation they are in, I have no idea! Might have to play around a bit to figure out, i.e. looking into the actual sequences.

:man_shrugging: I think they had to separate these files due to upload size-limits? It would be convenient if they represented Run1 vs Run2 samples (see below) but I don’t know. You should probably just combine those files into 1 larger file anyways.

I said it would be ideal if they were separated based on Runs because if you are planning on using DADA2 for your denoinsing you’ll want to separate those samples based on the 2 MiSeq runs that they are from. This is mentioned in the descriptions. This is because the error model will be different between the 2 runs so you’ll want to process them separately and then merge your tables after. If you use Deblur for denoising, this isn’t necessary and you can process all of them at once.
If you decide to use DADA2, the easiest way to separate the samples in Qiime2 is to figure out which samples are from Run1 and only include those samples in the mapping file when demultiplexing. Then repeat the demultiplexing step with a mapping file that only has Run2 samples.

A bit of a tricky situation, especially if you are not familiar with Qiime2 commands but totally doable, keep us posted.
Good luck!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.