Importing Pair End FASTQ sequences from a single file

cbippert · August 16, 2018, 6:12pm

Hello,

I am trying to import a single fastq.gz file downloaded from a SRA file from NCBI. The study says that it is paired end.

Here is a sample of the data: test_data.txt (1016 Bytes)

I assume that it has already been demultiplexed, because I can find no barcodes.

Trying to import the sequences using

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path test_mouse1 \
  --output-path test_mouse_fastq.qza

results in

I also tried creating a manifest file manifest.txt (205 Bytes)

When I tried to do that path using commands

qiime tools import \
  --type 'SampleData[SequencesWithQuality]' \
  --input-path sample1 \
  --output-path paired-end-demux.qza \
  --source-format PairedEndFastqManifestPhred33

I get this:

I think the main issue is the format of the fastq file (where the quality score is separated between multiple lines), but I do not know how to fix that problem, if it is the issue.

Can someone help me out, please?

Also, I have finished the Importing Data tutorial. It was informative, but it did not seem to help.

Thank you for your time,
Clinton

willowblade · August 16, 2018, 10:11pm

Hi Clinton,

I am not part of the QIIME2 team, but I think your issue is that QIIME2 is not set up to handle data that is demultiplexed in the header on the import. As such, you will probably need to reformat the data before you can import it. Looking at the sample data, it appears to me that the header contains the information on whether any given read is read one or read two. E.g. @DRR089861.1.1 is read one, while @DR089861.1.2 is read two. I’d need to see a larger segment of the file to figure out which part of the header indicates individual sequence reads, but you should be able to use this information to write code that will split your file into the format needed by QIIME2.

Good luck!

thermokarst · August 17, 2018, 1:18pm

Hey there @cbippert!

If it is only one sample you are downloading, then yes, it is demuxed. If the file represents more than one sample, then no, it is still multiplexed. Judging from your manifest file, it is in fact just one sample.

So, the first import command you ran failed because the name of the file does not match what QIIME 2 was expecting to see --- this is fine, and in fact, is one of the reasons the more flexible manifest format exists.

On to the manifest. First off, the import command is specifying the wrong type for paired-end reads --- the command should actually say:

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path sample1 \
  --output-path paired-end-demux.qza \
  --source-format PairedEndFastqManifestPhred33

As well, let's look at your manifest:

sample-id,absolute-filepath,direction
test_mouse,/home/qiime2/Desktop/q2-mouse-seqs/test_mouse/DRX083587_1.fastq,forward
test_mouse2,/home/qiime2/Desktop/q2-mouse-seqs/test_mouse/DRX083587_2.fastq,reverse

This doesn't quite conform to the spec listed in the docs --- the sample-id for the second row should be test_mouse --- it is part of the same sample as the row above, just the reverse reads.

Hope that helps!

cbippert · August 17, 2018, 9:31pm

Thank you both for your help.

There appeared to be a both problems.

The default program I was using to load the sequences was not generating them correctly. I changed my computer default settings to ensure that it was running in Excel-type program. That fixed the format issue that I was having.
I used the code provided by Matthew. After fixing my manifest file and fixing some errors in the sequence data, the program was able to run correctly and generate a QIIME artifact.

Thank you again for your help.

Clinton

system · September 18, 2018, 3:31am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.