BadZipFile: how to find the corrupted file?

Hi,

I'm experiencing similar issues as this and this post: corrupted files.

I can import my .fastq.gz files in qiime, but when I run qiime demux summarize \ --i-data raw_reads_.qza \ --o-visualization raw_reads_SSD.qzv

I get error:

raceback (most recent call last):
File "/home/thebiobeast/miniforge3/envs/qiime2-amplicon-2024.2/lib/python3.8/site-packages/q2cli/util.py", line 492, in _load_input_file
artifact = qiime2.sdk.Result.load(fp)
File "/home/thebiobeast/miniforge3/envs/qiime2-amplicon-2024.2/lib/python3.8/site-packages/qiime2/sdk/result.py", line 80, in load
archiver = archive.Archiver.load(filepath)
File "/home/thebiobeast/miniforge3/envs/qiime2-amplicon-2024.2/lib/python3.8/site-packages/qiime2/core/archive/archiver.py", line 367, in load
archive.mount(path)
File "/home/thebiobeast/miniforge3/envs/qiime2-amplicon-2024.2/lib/python3.8/site-packages/qiime2/core/archive/archiver.py", line 198, in mount
self.extract(filepath)
File "/home/thebiobeast/miniforge3/envs/qiime2-amplicon-2024.2/lib/python3.8/site-packages/qiime2/core/archive/archiver.py", line 212, in extract
zf.extract(name, path=str(filepath.parent))
File "/home/thebiobeast/miniforge3/envs/qiime2-amplicon-2024.2/lib/python3.8/zipfile.py", line 1630, in extract
return self._extract_member(member, path, pwd)
File "/home/thebiobeast/miniforge3/envs/qiime2-amplicon-2024.2/lib/python3.8/zipfile.py", line 1702, in _extract_member
shutil.copyfileobj(source, target)
File "/home/thebiobeast/miniforge3/envs/qiime2-amplicon-2024.2/lib/python3.8/shutil.py", line 205, in copyfileobj
buf = fsrc_read(length)
File "/home/thebiobeast/miniforge3/envs/qiime2-amplicon-2024.2/lib/python3.8/zipfile.py", line 940, in read
data = self._read1(n)
File "/home/thebiobeast/miniforge3/envs/qiime2-amplicon-2024.2/lib/python3.8/zipfile.py", line 1030, in _read1
self._update_crc(data)
File "/home/thebiobeast/miniforge3/envs/qiime2-amplicon-2024.2/lib/python3.8/zipfile.py", line 958, in _update_crc
raise BadZipFile("Bad CRC-32 for file %r" % self.name)
zipfile.BadZipFile: Bad CRC-32 for file 'ce11c3ae-864f-49af-a884-274406195176/data/804_802_L001_R1_001.fastq.gz'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/thebiobeast/miniforge3/envs/qiime2-amplicon-2024.2/lib/python3.8/site-packages/q2cli/click/type.py", line 116, in _convert_input
result, error = q2cli.util._load_input(value)
File "/home/thebiobeast/miniforge3/envs/qiime2-amplicon-2024.2/lib/python3.8/site-packages/q2cli/util.py", line 397, in _load_input
artifact, error = _load_input_file(fp)
File "/home/thebiobeast/miniforge3/envs/qiime2-amplicon-2024.2/lib/python3.8/site-packages/q2cli/util.py", line 498, in _load_input_file
raise ValueError(
ValueError: It looks like you have an Artifact but are missing the plugin(s) necessary to load it. Artifact has type 'SampleData[PairedEndSequencesWithQuality]' and format 'SingleLanePerSamplePairedEndFastqDirFmt'

There was a problem loading 'raw_reads_SSD.qza' as an artifact:

It looks like you have an Artifact but are missing the plugin(s) necessary to load it. Artifact has type 'SampleData[PairedEndSequencesWithQuality]' and format 'SingleLanePerSamplePairedEndFastqDirFmt'

It appears that some of the .fastq.gz are corrupted.

I ran unzip -t raw_reads_.qza and indeed two times I get:

ce11c3ae-864f-49af-a884-274406195176/data/616_1498_L001_R2_001.fastq.gz bad CRC b506cd09 (should be 43f5c748)
&
ce11c3ae-864f-49af-a884-274406195176/data/804_802_L001_R1_001.fastq.gz bad CRC 7f231ee9 (should be 3fce1e29)

all other ~ 1700 lines were OK.

However these file names (e.g. 616_1498_L001_R2_001.fastq.gz) are not the files names I actually have.

How can I find out what my file names are for these corrupted .fa.gz files, so I can remove them?

Hi @Rob_DNA,

Good use of unzip here! Regarding the file-names. This really shouldn't be possible. Could you provide some additional context on your import command and maybe where these files come from in general?
An ls of the original directory would be perfect if you wouldn't mind sharing that.

Hi Evan,

thanks for the quick reply. I'm currently redownloading the data using wget instead of just clicking a download link. Perhaps something went wrong with the browser download of the data and is it an easy fix. I'll let you know how that turns out. If I still get the errors, I'll post the information you request.

Regarding the file names, I know it sounds strange, but it is really the case. Somebody else reports a similar thing:

Any updates on this @Rob_DNA?

We don't do any header parsing for importing demux data, so I really don't have any explanation for the extra files without additional information on the import command and list of files implicated.

@ebolyen apologies for the late reply. 1) I had been sick with the flu and 2) I want to redownload the data, but the download link is broken (the peer closes the connection). I contacted the data center that provides the download links, but the person responsible is/was also sick. I thus haven't been able to download the data again and thus have no update.

This is interesting, as it does appear to happen!

I'll give you an update after I redownloaded the data.

2 Likes