Splitting fastq files

paaigehansen · March 16, 2020, 1:14am

Hi everyone,

I'm trying to import Casava 1.8 paired end reads into QIIME2. However, when these samples were submitted for sequencing, there were 2 errors (duplicate sample IDs) in the mapping file that resulted in 4 samples being combined in 2 fastq files (i.e., there are 2 fastq files when there should be 4, 2 samples per fastq). Within both of these 2 fastq files, the samples have their original, unique barcodes. Here's an example of one of the fastq files:

CCCCCGGGGGGGGGGGGEGGGGGGGGGGGGFFGGGGGGGGGGGGGGGGGGGGGFFFGGGGGGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDFGGGGGGGGGGGGGGGGGGGGGCEGGGGFEGGGFGGEGGGGGGGGGGGGGGGGGDGEGCFGGGGGGGGDEGCCEGGGFEFFCFFGECCCEGGGGGCEC7EF8E:*CE5ECG?CEG?FGCFGGGCF=:CE89C7=CEFGC>=D>:7F9CCG:?CE3/>7CFGCDGG7FF?)/))-)<FA?=)9944-54<F?)(704>04(())

@M00161:110:000000000-CGG2R:1:1101:11873:1171 1:N:0:TACGCTGC+GTAAGGAG

GTGAGTCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCCGTTCGAGCGTCATTACAATCCTCAAGCCTGGCTTGGTGTTGGGGCCTGCTGCTACTGGCAGCCCTTAAAACCAGTGGCGGTGCCATCTGGCTCCTAAGCGTAGTAATACCCCTCGCTACAGGGTCCGGTGGATGCCTGCCAGCAACCCCCCATTTTTCTATGGTTGACCTCGGATCGGGTCGGGATACCCGCTGAACTTAAGCATATCAATCAGCGGAGGACTGTCCCTTATACACATCCCC

CCCCCGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGFFGGGGEGGGGGGGGGGGGGGGGDEFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGCDEGGGDGGGGGGGGGGFEDFGGGFFGGGGGGGGGGGGGGGGCE8CGGGGEGGGGGGGGGGGGGGGG6<<=FGGG=EGGG77AFFGCFG7CCFGGGGCF=EGDDCDGG35*/:3CC<9<3)7D*7CGGGGC6@F47**)*09,)<9BE??F><9))74:9A4).<,3(.4

@M00161:110:000000000-CGG2R:1:1101:15918:1176 1:N:0:CGGAGCCT+GTAAGGCG

GTGAGTCATCGAATCTTTGAACGCAAATTGCACTTCCTGGTACTCCGGGAAGTATGCCTGTTTGAGGGTCAGTATAATCACAATCGAGTGTGTTTTTTTTTTTTTTATTTGGTATCACTATCGGACTCGAGTTATATTAATTGTAATTGATTTAAGTGACTCTAAATTAACTACGTCTTTTAGGCGTGATTCGAATTTTATTTTTGCGTCCTTAATATTTTTTTTTCATTAGCTGTGATTTTCGTCATTATATAGGAAAACGTGTCTATAATTTTTTTTGACATTTACCTGAATTCAGGTA

Does anyone know how I can separate these 2 fastq files in a way that allows me to analyze them with the rest of my sequencing library? Maybe by barcode? If I remove these samples, I am able to import the rest of the library.

Thanks!

Nicholas_Bokulich · March 16, 2020, 3:53pm

Welcome to the QIIME 2 forum, @paaigehansen!

I have re-classified this as "other bioinformatics tools" because this is a technical question about something outside of QIIME 2.

This calls for a bit of custom code to separate those files. Fortunately, I think some simple grep will do the job here, since the barcode info in the header lines provides unique information.

This should do it, but no guarantees this will work, I am just cooking this up from memory and have not tested. You can run this once for each file, just pop this into your terminal:

grep -A 3  'N:0:TACGCTGC+GTAAGGAG' put-path-to-original-file-here.fastq > put-path-to-new-file-for-barcode-TACGCTGC+GTAAGGAG-here.fastq

This will grab only lines that contain whatever barcode you put in quotes, and the following 3 lines.

Good luck!

paaigehansen · March 16, 2020, 7:20pm

I think this did the trick! Thanks for the suggestion!