A general tool to import FASTQ files

I have downloaded single end FASTQ files from sequence archive.

e.g. HMb.MMb.d0.1.fastq.gz

@HMb.MMb.d0.1_19 MISEQ:267:000000000-ABTKW:1:1101:15444:1926 1:N:0: orig_bc=ACACCTGGTGAT new_bc=ACACCTGGTGAT bc_diffs=0
TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTTAGTGCAGAAGAGGAGAGGGGAACTCCATGTGTAGCGGGGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGTTCGAAAGCGTGGGTAGCAAAC
+
BBBBBFFA5FAB4AFGGEG2FGHGFGFGHHAGFHHAEGGGHHGG?A0BDGHAEEF>[email protected]@BGH//>>[email protected]>FGGC2<0<FEAD/CCADGHFHGD1<DGDD<@@----9/;0E..9.;0CFFGGFB/;BFBBA..//B99-9-./9--;A/;/BB/;B:FFFFFEBB.?DFDFEBFBD.99ABA:AB;.;EBFBF#
@HMb.MMb.d0.1_26 MISEQ:267:000000000-ABTKW:1:1101:14461:1949 1:N:0: orig_bc=ACACCTGGTGAT new_bc=ACACCTGGTGAT bc_diffs=0
TACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGGGAACGCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGTAGTGCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGTTCGAAAGCGTGGGTAGCAAAC
+
[email protected]>EFG4FFGGG4GHFHBGF4FFGHHHHGD>?/?BFGEE[email protected]<.AEFD0<GF?D:.;:CFGFHHFHFGHGG?990;BE??:[email protected]@ABB/BFFFFFFFFF/99/9A?FDA?BFFCEADFFD9ACF.BEF/;BA

After reading QIIME2 tutorials on 'import (https://docs.qiime2.org/2017.11/tutorials/moving-pictures/#sequence-quality-control-and-feature-table-construction) ', I don’t think these files belong to existing manifest.

How can I import a general FASTQ file?

The easiest way (IMO) is to make you’re own manifest file.

I recommend looking at the data and working out if its Phred64 or Phred33.

Then download an appropriate manifest from here, and rename the sample numbers to something that makes sense to you, and then change the path of the files location.

The manifest file must contain all of the ‘samples’ you wish to run a method on at a time.

As you only have SE data you will need to change where it says direction to all say forward.

Hint: $PWD/ will be whatever the directory the command line is pointed at.

Hope this helps :+1:

I have tried to that, but it gave the following errors:

> qiime tools import   --type 'SampleData[SequencesWithQuality]'   --input-path se-33-manifest --output-path single-end-demux.qza   --source-format SingleEndFastqManifestPhred33
There was a problem importing se-33-manifest:

  Missing one or more files for SingleLanePerSampleSingleEndFastqDirFmt: '.+_.+_L[0-9][0-9][0-9]_R[12]_001\\.fastq\\.gz'


> head se-33-manifest
sample-id,absolute-filepath,direction
HMb.MMb.d0.1,$PWD/data/HMb.MMb.d0.1.fastq.gz,forward
HMb.MMb.d0.2,$PWD/data/HMb.MMb.d0.2.fastq.gz,forward

@zhan_xw

I came across a similar issue because my .fastq file contained all of my sample ID’s, even though it was demultiplexed (not quality filtered).

Some conditions: your .fastq file needs to be demultiplexed, and have barcodes and primers removed from the sequences. My files contained no primer sequences, and my barcodes were removed using QIIME 1’s demultiplex_fasta.py. My sequences were also pre-joined paired-end, but this should also work for single-end. EDIT: with version 2017-12 and the addition of the cutadapt trim- plugin, primers can be removed from your sequences.

If those conditions aren’t met, you should be able to use QIIME 1’s demultiplex_fasta.py or split_libraries_fastq.py. As you probably know, split_libraries_fastq.py will also quality filter your data in addition to demultiplexing it. I’m not exactly sure what your downstream goal is. I did not want to quality filter in QIIME 1.

Anyway, everything looked like this:

# demultiplex, but not quality filter
# supplying the qual file truncates it to match the demultiplexed seqs file
$ demultiplex_fasta.py -b 8 -m fna_map_qual/map.txt -f fna_map_qual/seqs.fna -q fna_map_qual/qual.qual -o dmptlx

# count sequences prior to conversion (629391 : Total)
$ count_seqs.py -i dmptlx/demultiplexed_seqs.fna

# convert .fasta and .qual to .fastq
$ convert_fastaqual_fastq.py -F -b -c fastaqual_to_fastq -f dmptlx/demultiplexed_seqs.fna -q dmptlx/demultiplexed_seqs.qual -o fastq/

# count sequences after conversion (629391 : Total)
# make sure they match the output from above
$ count_seqs.py -i fastq/demultiplexed_seqs.fastq

# filter each sample ID separately
# might be tedious depending upon how many samples you have
# I have shortened mine to just two sample IDs, so there isn't a huge chunk of code here
# for: --sample_id_fp sampID/AH1.txt : I just shortened my mapping file to contain only that SampleID
filter_fasta.py -f fastq/demultiplexed_seqs.fastq -o fastq/AH1.MS28F.388R.fastq --sample_id_fp sampID/AH1.txt 
filter_fasta.py -f fastq/demultiplexed_seqs.fastq -o fastq/AH2.MS28F.388R.fastq --sample_id_fp sampID/AH2.txt

# gzip each filtred fastq file
gzip fastq/AH1.MS28F.388R.fastq
gzip fastq/AH2.MS28F.388R.fastq

The first three lines of my manifest looked like:

sample-id absolute-filepath direction
AH1.MS28F.388R $PWD/fastq/AH1.MS28F.388R.fastq.gz forward
AH2.MS28F.388R $PWD/fastq/AH2.MS28F.388R.fastq.gz forward

Given all of that, you should be able to import your data into QIIME 2 as follows:

qiime tools import \
    --input-path manifest_all.csv \
    --output-path demux_all.qza \
    --type 'SampleData[SequencesWithQuality]' \
    --source-format SingleEndFastqManifestPhred33

Best of luck, let me know if anything is unclear.

-Kristopher

Awesome post @kparke10!

Given this error:

It seems that something went wrong right at the end of converting.

I definitely haven’t seen this before. Could either of you post some sample data that reproduces this? I was pretty sure we were checking for everything that would go wrong in the transformer which converts the data, but clearly we missed something. Off the cuff, it looks like our transformer failed to write any data :frowning: .

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.