I am importing 96 paired end samples (182 fastq files) using the manifest format using a university HPC running qiime2 version 2019.4 with the command:
qiime tools import
I imported a subset of samples to get an idea of the time and it is approximately 25 minutes per file. The importing itself is working fine but I am wondering if I can speed up the process.
When I first tried to import the whole dataset, the process ran out of space in tmp (this was fixed by assigning a different tmpdir space in datastorage) and is now running. However as the new tmpdir isn’t local to the node I think it will take even longer to run - I can see in the tempdir that it has taken 24 hours to import about 1/4 of the files.
I understand that it is hard to estimate how long importing takes due to different sizes of sequencing files, however this seems particularly long (but I do acknowledge that the PHRED quality conversion is quite time consuming).
The files are in fastq format (ie they are not gzipped) and I wonder if this is contributing to the time this is taking - if I gzipped the data myself before importing, could this save time?
I have read that importing this way can take either .fastq or .fastq.gz and will gzip the imported data if it is not already so I’m not sure if zipping it first would save time.
My thinking is if the command first has to copy the data in and then work on zipping it, would it instead be faster to zip it first so that the large unzipped files don’t have to be copied? And is part of the reason that I ran out of space in tmp because they are unzipped?
So in short, does importing unzipped data and allowing qiime to zip it take more time or the same amount of time as the user first zipping and then importing (perhaps due to the burden of copying large unzipped folders into tmp)?
Ultimately I could try it out both ways myself to compare, but wondered if anyone could shed light on the internal workings of import to help me understand. Thanks!