Importing unzipped fastq files using manifest - will zipping increase speed?

Hi,

I am importing 96 paired end samples (182 fastq files) using the manifest format using a university HPC running qiime2 version 2019.4 with the command:

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path ../fastq_manifest.tsv
--output-path ../02_qiimeimport/paired-end-16S.qza
--input-format PairedEndFastqManifestPhred64V2

I imported a subset of samples to get an idea of the time and it is approximately 25 minutes per file. The importing itself is working fine but I am wondering if I can speed up the process.

When I first tried to import the whole dataset, the process ran out of space in tmp (this was fixed by assigning a different tmpdir space in datastorage) and is now running. However as the new tmpdir isn't local to the node I think it will take even longer to run - I can see in the tempdir that it has taken 24 hours to import about 1/4 of the files.
I understand that it is hard to estimate how long importing takes due to different sizes of sequencing files, however this seems particularly long (but I do acknowledge that the PHRED quality conversion is quite time consuming).

The files are in fastq format (ie they are not gzipped) and I wonder if this is contributing to the time this is taking - if I gzipped the data myself before importing, could this save time?
I have read that importing this way can take either .fastq or .fastq.gz and will gzip the imported data if it is not already so I'm not sure if zipping it first would save time.
My thinking is if the command first has to copy the data in and then work on zipping it, would it instead be faster to zip it first so that the large unzipped files don't have to be copied? And is part of the reason that I ran out of space in tmp because they are unzipped?

So in short, does importing unzipped data and allowing qiime to zip it take more time or the same amount of time as the user first zipping and then importing (perhaps due to the burden of copying large unzipped folders into tmp)?

Ultimately I could try it out both ways myself to compare, but wondered if anyone could shed light on the internal workings of import to help me understand. Thanks!

Hello @niamh55,

Welcome back to the forums! :wave:

This is a great question! I think the answer is maybe, if your slowdown is caused by file read speed. If it’s limited by write speed or the Phred score conversion, then no.

I’m not sure how much time this will save, but you could try another tool to do this ascii conversation. Say vsearch:

vsearch --fastq_convert sample1.fastq \
  --fastq_ascii 64 --fastq_asciiout 32 \
  --fastqout sample1_32.fastq

Or even better, you could compress your output! :clamp:

vsearch --fastq_convert sample1.fastq \
  --fastq_ascii 64 --fastq_asciiout 32 \
  --fastqout - | pigz -p 8 > sample1_32.fastq.gz

Colin

1 Like

Thank you very much @colinbrislawn!

Just compressing the files has a slight impact, however compressing and converting through vsearch and then importing to qiime offered a decent improvement.

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.