how to import one fastq file including all your sample sequences and to make a mapping file to connect with them

I plan to analyze Illumina Paired-end Sequencing data by using qiime2. However, these data were sequenced three years ago, and forward and reverse reads have been merged, and barcodes have been removed, even demultiplex samples have been merged into a fastq file, and I wonder how to imports this fastq file by using qiime2. Thank you very much

the sequences in the fastq like these below:
and 1_1_1 is a sample name, _1,_2… are sequences in sample 1_1_1, I have 18 samples in total in this fastq files

@1_1_1_1 M03073:62:000000000-AEP8L:1:2113:20684:11199 1:N:0:CAGGCG orig_bc=CGTACG new_bc=CGTACG bc_diffs=0
GTTAGGAATCTTGGGAATGGGCGAAAGCCTGACCCAGCAACGCCGCGTGGGCGATGAAGGCCTTCGGGTCGTAAAGCCCTTTTGTGGGGGAAGATGATGACGGTACCCCACGAATAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTACTGGGCGTAAAGGGCGTGCAGGCGGTTCGTTAAGTTCGAGGTGAAAGCTCCCGGCTCAACTGGGAGAGGTCCCCGGATACTGGCGGACTTGAGGGAGGCAGAGGAAAGTGGAATTCCCGGTGTAGTGGTGATATGCGTAGATATCGGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGGCCAGTACTGACGCTGAGGAGCGAAAGCGTGGGTAGCAAACAGGATTAGAAACCCTTGTAGTCC
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGG>CFGGGGGGGGGGGDFDEGGGGGGGGGGGGGEGGGGGGGGGGGG#GGGGGG#F#GGEE=EFCGFGGGGCECE7EGGGGGGGGFGGGEGFGGEFGA@FEAFAAFAFFD:7FGCFEGFEGGDGGGFGGGGGGGGGGGGGGGGGGGEGEECGGGGGGGGGFFFGGGGCFGGGGEEGGGGGGGGGFFEFGGGGGDGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGG
@1_1_1_2 M03073:62:000000000-AEP8L:1:2106:24993:14973 1:N:0:CAGGCG orig_bc=CGTACG new_bc=CGTACG bc_diffs=0
GTGAGGAATATTGGGCAATGGGCGCAAGCCTGACCCAGCCACGCCGCGTGCAGGATGACACCCCTATGGGGCGTAAACTGCTTTTATACGAGAAGAACCTCCCGAACTTGTTCGGGACTGACGGTACCGTATGAATAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCAAGCGTTGTCCGGATTTATTGGGTGTAAAGGGTGTGTAGGCGGGTTTGCAAGTCAGAGGTGAAATCCTGCGGCTCAACCGTAGAATTGCCTCTGATACTGCAGGTCTTGAGTCCTGGAGAGGTTGTCGGAATTCGTGGTGTAGCGGTGAAATGCGTAGATATCACGAGGAACACCGGTTACGAAGGTGGGCAACTGGACAGGTACTGACGCTGAGGCACGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCTAGTAGTCC
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFF9CGGGGGGFGGGGFGEGFFGGGGGGGCGEGEFGGG@CFDGFGFFFGGGGGGGGGGEFGGFGGGGGGFFF<BAEBAE8<@AFGFCAC5BFF?5DFFGFGFGGGGEDGGFCGC>GGGGGGECGGGFFFGGDGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGG
@1_1_1_3 M03073:62:000000000-AEP8L:1:1112:21078:24139 1:N:0:CAGGCG orig_bc=CGTACG new_bc=CGTACG bc_diffs=0
GTAGGGAATCTTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAACGATGAAGGTCTTCGGATCGTAAAGTTCTGTTGTCAGGGAAGAACAAGTGCCGTGCGAATAGAGCGGCACCTTGACGGTACCTGACGAGGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGGAAGCGTTGTCCGGAATTATTTGGGGTAAAGCGCGCGCAGGCGGGTTCTTAAGTCTGATGTGAAAGCCCACGGCGCAACCGTGGAGGGTCATTGGAAACTGGGGAACTTGAGTACAGAAGAGGAGAGTGGAATTCCACGTGTAGCGATGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGCCCAATACTGACACTCATGTGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCCAGTAGTCC
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGFGFGGGGGGGFGGGGGGFGGGGGFGFGGGGGEGGGGGGGGGGGGGGGFGGGGGFFDFFCGEGGD7FFDGGGGGGDGGGG?FGGGGG@FFFFFDC7<E8E5C9EC<C88EEE
:CCA8
;C89<EC8CF6FGF+<E8/1<<:<>EGGE58C58851+<+<##9@?++?E@;7((/((5;9>))BGB910)3)98>@5A@ACFCC7ADFFGFFFCFC8CFFFD@DGDGDFA82@9GFFGGGFEEE>=8B=8=+@9GGGGGGGGFFGGGE=GGGGFE?GGGGGFGGGGGGGGGGGGGEGFGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGF7GGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGDG
@1_1_1_4 M03073:62:000000000-AEP8L:1:2116:7623:19303 1:N:0:CAGGCG orig_bc=CGTACG new_bc=CGTACG bc_diffs=0
GTGGGGAATTTTGGACAATGGGGGCAACCCTGATCCAGCCATTCCGCGTGAGTGAAGAAGGCCTTCGGGTTGTAAAGCTCTTTCAGTTGGAACGAAACGGTACGCTCTAACATAGCGTGCTAATGACGGTACCGACAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGATTGTTAAGCAAGATGTGAAATCCCCGGGCTTAACCTGGGAATGGCATTTTGAACTGGCAGTCTAGAGTGCGTCAGAGGGGGGTGGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAACACCGATGGCGAAGGCAGCCCCCTGGGATGACACTGACGCTCATGTACGAAAGCGTGGGTAGCAAACAGGATTAGAAACCCCAGTAGTC
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGDGGGGGGGGGGGGGGGGEGGGDGGGGGDGFGGGGEECGGGGGGGGGGC,FGGGGGGGGGGGGGGGGGFGGC>CCCEGGGGGGFFGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGF6CFCFGF7FGGGGDGG#FGGGEFFGF==:BE:66FEE65E:BFEE7B@FFFFFDA7FFFEGAGDGGGGGFDGCGGGGGGGGGFFFEGFFGGGGCGGGGGGFGGGGGGGGGGGGGEGGGGGGEDGGGGGGF@DGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@1_1_1_5 M03073:62:000000000-AEP8L:1:1109:8535:11431 1:N:0:CAGGCG orig_bc=CGTACG new_bc=CGTACG bc_diffs=0
GTGGGGAATATTGCGCAATGGGCGAAAGCCTGACGCAGCGACGCCGCGTGGGTGATGAAGGCCCTCGGGTCGTAAAGCCCTGTCGGGAGGGAAGAAACATTGCCGATCGAATAGATCGGTGACTTGACGGTACCTCCTAAGGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTGTTCGGAATCACTGGGCGTAAAGCGCGTGTAGGCGGCCTCCTAAGTCCGATGTGAAAGCCCGGGGCTCACCCCCGGAAGTGCATCGGATACTGGGAGACTAGAGTACCGGAGAGGAGGGTGGAATTCCTGGTGTAGCGGTGAAATACGTAGATATCAGGAGGAACACCGGTGGCGAAGGCGGCCCTCTGGACGGATACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCGAGTAGTCC
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGCFGGGGGGGGEGEG5
:E:EGGGGGGGGGFGGGGGGGGEEGGGGGDE#GGGGFGGGE#GGGEFDF#FGGGGGGGGC9CFGGGGGGGGG9CBFGGFD4=?
@52==CB+>F>GFFFFCFFFGFFB>?5GGGGGFGGGGGGGGGFGGGDGGGGGGGGGGGGGGFGCGGGGGGGGGGGGGGGGFCGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGDGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGG
@1_1_1_6 M03073:62:000000000-AEP8L:1:2107:4046:9113 1:N:0:CAGGCG orig_bc=CGTACG new_bc=CGTACG bc_diffs=0
GTGAGGGATATTGCGCAATGGGGGAAACCCTGACGCAGCGACGCCGCGTGGGTGAAGACGGTCTTCGGATTGTAAAGCCCTTTTCTGTGTGACGAGTAAGGACGGTAGCACAGGAATAAGTCTCGGCTAACTACGTGCCAGCAGCCGCGGTAAAACGTAGGAGGCAAGCGTTATCCGGAGTTACTGGGCGTAAAGCGCGCGCAGGCGGCGTGTTAAGTGTGGGGTCAAAGGTCCAGGCTCAACCTGGGAAAGGCAACACAGACTGACGCGCTGGAGGCAGCTAGAGGGACGCGGAATTCGGGGTGTAGCGGTGAAATGTGTAGAGATCCCGAGGAACACCAGCGGTGAAGACGGCGTCCTGGGGCTGACCTGACGCTGAGGCGCGAAGGCGTGGGGAGCGAACGGGATTAGATACCCGTGTAGTCC
+
GGGGGGGGGGGGGGGGGGGGFGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGDGGGGGGGGGGGEFCFGGGGGGFGDGCGGGGGGGGGGGGGGG>:FGGGCECGGGGGGGGGGGGGGGGGGFGGGG,FGFGGGGFF@FEF<:<9:FFGFFG9FGGGFGGGGGGFDFGGGG;@FGEGGGD5*:EFGGGGG>ECGGGGEGGGGGGCC7ECC:?EGGGGFGF@69:<C1CCGCFFGCFGGGFFFFFFFE##C@;;;FF@>EEFAFFFF7)EGFFAFFFFGGGGGGGGFB8@GGGGGGGEGGGGF>ACFGGGGGDGGGGC3F@GGDGGGFGGFGGGGGGFGGGGGGGGGGGFGDGGGGGGGGECGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGG
@1_1_1_7 M03073:62:000000000-AEP8L:1:1119:8673:15050 1:N:0:CAGGCG orig_bc=CGTACG new_bc=CGTACG bc_diffs=0
GTGGGGAATTTTGCGCAATGGGGGCAACCCTGACGCAGCAACGCCGCGTGGAGGATGAAGGTTTTCGAATCGTAAACTCCTGTCCGGAGGGAAGAAAGCAATGACGGTACCTCTGGAGGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGGGGGGGCGAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGTCTCAGAAGTTTTGGGTGAAAGCCCTCGGCTCAACCGAGGAAGTGCCTGGAAAACCATGGGACTGGAGTGCTGGAGAGGCAAGCGGAATTTCTGGTGTAGCGGTGAAATGCGTAGATATCAGAAGGAACATCTGAGGCGAAGGCGGCTTGCTGGACAGACACTGACGCTGAGGCGCGAAAGCCAGGGGAGCAAACGGGATTAGAAACCCGAGTAGTCC
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGG#GGG#DGEGFEGGG#GGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGFGGFGFGGGGGGGGG#GFGGGGGFGFGFGGGGGGGGGGGGGFGGGGGGGGGGGFFFA;56FFFGGGGFFGGF6DGGGGGFFCFBFEDFGGGGDEGGGGF>GGGGGGGGGGGGGGGGGGGGGGFAF@GGGGGGGGGDGGGGGGGGGFCGFGGGGFGGGGGGGFGGGGGGGGFEFGFCGGGGGGFGGFAGGGGFGGGGEGEGGGFAFEGGGEG

1 Like

Hello @laibinhuang,

I think I have a pretty good way to do this:

This command, will give you just the reads from the 1_1_1 sample:
grep -A3 "^@1_1_1" all.fastq > just_1_1_1.fastq

So you could run that 18 times, or write a loop to process all your samples.

Then you can import all 18 fastq files using the fastq manifest format!

Let me know how well this works for you!
Colin

Thank you very much, in which terminal, qiime2 or cmd comand line in windows

Appreciate

1 Like

grep is linux program, so either linux terminal, Mac OSX terminal, or Windows Subsystem for Linux terminal.

Dear Colinbrislawn,

you mentioned a loop, "grep -A3 "^@1_1_1" all.fastq > just_1_1_1.fastq So you could run that 18 times, or write a loop to process all your samples"
how can I write a loop to split all my samples from all.fastq into an individual one

like I have sample ID(100 samples):
1_1_1
1_1_2
1_2_1
1_2_2
1_3_1
1_3_2
...
Thank you very much

1 Like

Linux loops are very powerful and flexible. Take a look at these examples.

For your files, I would start by making a file called sampleIDs.txt that has a list of all your sample IDs.

Then you could read through that file with a loop:

while read line
do
  echo $line
done < sampleIDs.txt

Then modify that loop so it runs grep on all of your files. Maybe like this:

while read line
do
  grep -A3 "^@$line" all.fastq > just_$line.fastq
done < sampleIDs.txt

Let me know what loop you write and how well it works for you! I’m happy to help answer any questions you have along the way.

Colin

Thank you very much, I will try it out, the first code (grep -A3 “^@1_1_1” all.fastq > just_1_1_1.fastq) you provided works perfect

1 Like

Dear Colibrislawn,
when I put this (below) in Qiime2 it works:
qiime2@qiime2core2019-4:~$ grep -A3 “^@1_1_1” /media/sf_Qiime/YRE/data/YRE16S_338F_806R/YRE16S_338F_806R.fastq > /media/sf_Qiime/YRE/data/YRE16S_338F_806R/1_1_1.fastq

however: this doesn’t work
qiime2@qiime2core2019-4:~$ grep -A3 “^@$line” /media/sf_Qiime/YRE/data/YRE16S_338F_806R/YRE16S_338F_806R.fastq > /media/sf_Qiime/YRE/data/YRE16S_338F_806R/$line.fastq
done < /media/sf_Qiime/YRE/data/YREID.txt

it says grep:done: no such file or directory

Do you have another solution with this

Thank you very much

1 Like

I'm glad the grep command is working for you!

Now let's get the loop working...

Compare that to my code: :thinking:

while read line
do
  grep -A3 "^@$line" all.fastq > just_$line.fastq
done < sampleIDs.txt

What's missing?


Hint:

This would also work:

while read sampleID
do
  grep -A3 "^@$sampleID" all.fastq > just_$sampleID.fastq
done < sampleIDs.txt

but this would not:

  grep -A3 "^@$sampleID" all.fastq > just_$sampleID.fastq
done < sampleIDs.txt

Thank you very much, I will try it out tomorrow, really appreciate

1 Like

Dear Colinbrislawn,
while read sampleID
do
grep -A3 “^@$sampleID” all.fastq > just_$sampleID.fastq
done < sampleIDs.txt

I got this error:
.fastq: protocol error 1_1_1
.fastq: protocol error 2_2_1

Thank you very much
Laibin

1 Like

Hum… I’m not sure what’s going on here. :thinking:

Let’s try some things and see what parts are working.

  1. just output name of each file
while read sampleID
do
echo $sampleID
done < sampleIDs.txt
  1. just output first line of each sample
while read sampleID
do
grep -A3 '^@$sampleID' all.fastq | head -n 1
done < sampleIDs.txt

How do these work for you?

Colin

Dear Coline,
while read sampleID
do
echo $sampleID
done < sampleIDs.txt
the above code can just output the first two sample name
1_1_1
2_2_1

while read sampleID
do
grep -A3 '^@$sampleID' all.fastq | head -n 1
done < sampleIDs.txt

this one will do nothing

I uploaded these two files for you to test,
Really appreciate
laibin
SampleID.txt (19 Bytes)
Samplesequence1.txt (45.4 KB)

Hello Laibin,

Thanks for trying all these commands!

I noticed something… this line

grep -A3 ‘^@$sampleID’ all.fastq | head -n 1
         ^    these  ^ are a fancy quotes
They should be 'normal' quotes (not ‘fancy‘ quotes)

Can you check if the command you are running includes these fancy quotes?

Colin

Dear Colin,

there is no problem with the quotes.

I find qiime1 can do this with split_sequence_file_on_sample_ids.py(http://qiime.org/scripts/split_sequence_file_on_sample_ids.html)

Thank you very much
laibin

1 Like

I found that the hardest part of getting my fastq files into Qiime was the way the files are named. Look at the qiime tutorial examples and name your files same with respect to the
(. - _ and L001).