skbio.io._exception.FASTQFormatError: Found blank or whitespace-only line before ‘+’ in FASTQ file

mbnalbright · July 11, 2018, 8:00pm

Thanks for the rapid reply. I was trying to use an already existing OTU table and representative sequences that had been generated in another program, so that makes sense.

I have been trying to run my data through QIIME2 from the beginning (demultiplexing) now but I am having other issues.

I have paired-end sequences and I am able to run the pipeline through using the paired end demultiplexed sequences and pathway, however I am losing most of my data in the read joining because the reverse reads have very poor quality. Thus, I have been trying to run through the pipeline using only the forward reads, but I keep getting the error below and I am not sure what is generating it or how to troubleshoot. I have looked at some of the files generated during demultiplexing by exporting them but they look okay.

Plugin error from vsearch:

Found blank or whitespace-only line before '+' in FASTQ file

Traceback (most recent call last):
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/q2cli/commands.py", line 246, in call
results = action(**arguments)
File "", line 2, in dereplicate_sequences
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/qiime2/sdk/action.py", line 222, in bound_callable
spec.view_type, recorder)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/qiime2/sdk/result.py", line 261, in _view
result = transformation(self._archiver.data_dir)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/qiime2/core/transform.py", line 59, in transformation
new_view = transformer(view)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/q2_types/per_sample_sequences/_transformer.py", line 371, in _12
for seq in fq_reader:
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/io/registry.py", line 506, in
return (x for x in itertools.chain([next(gen)], gen))
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/io/registry.py", line 531, in _read_gen
yield from reader(file, **kwargs)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/io/registry.py", line 1008, in wrapped_reader
yield from reader_function(fhs[-1], **kwargs)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/io/format/fastq.py", line 344, in _fastq_to_generator
seq, qual_header = _parse_sequence_data(fh, seq_header)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/io/format/fastq.py", line 481, in _parse_sequence_data
_blank_error("before '+'")
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/io/format/fastq.py", line 473, in _blank_error
raise FASTQFormatError(error_string)
skbio.io._exception.FASTQFormatError: Found blank or whitespace-only line before '+' in FASTQ file

(END)

I tried to upload the demultiplexed .qza file, but it was too large.

SINGLE END VERSION

qiime tools import
--type MultiplexedSingleEndBarcodeInSequence
--input-path SE/
--output-path multiplexed-seqs_RUN6_forward.qza

multiplexed-seqs_RUN6_forward.qza

#REannotate pine data for PICRUST

qiime cutadapt demux-single
--i-seqs multiplexed-seqs_RUN6_forward.qza
--m-barcodes-file RUN6_mapping.tsv
--m-barcodes-column BarcodeSequence
--p-error-rate 0
--o-per-sample-sequences demultiplexed-seqs6BAC_Forward.qza
--o-untrimmed-sequences untrimmed.qza
--verbose

qiime quality-filter q-score
--i-demux demultiplexed-seqs6BAC_Forward.qza
--o-filtered-sequences Filtered_BAC6_forward.qza
--o-filter-stats demux-joined-filter-statsBac6_forward.qza \

qiime vsearch dereplicate-sequences
--i-sequences Filtered_BAC6_forward.qza
--o-dereplicated-table Table6_forward.qza
--o-dereplicated-sequences Rep-seqs6_forward.qza

Nicholas_Bokulich · July 11, 2018, 8:09pm

it sounds like one of your sequences is missing the sequence part!

I wonder if qiime quality-filter q-score is dropping one of your sequences because quality is too low. (this seems unlikely but I am keeping an open mind)

could you please share the output of the following command:

qiime demux summarize \
  --i-data demultiplexed-seqs6BAC_Forward.qza \
  --o-visualization demux-summary.qzv

Could you send a snippet of the file? E.g., the output of:

head -n 20 sequences.fastq

mbnalbright · July 11, 2018, 9:37pm

Attached are two files. 1. demux-summary.qzv and 2. a piece of one of the demultiplexed fasta files

Also I looked to see if there were blank lines and all my files had them…here is a list of all my files … with numbers of blank lines found grep -c "^$"

Thanks!

100A_bacteriaseqs_CTACAGGGCAAG_L001_R1_001.fastq.gz:8131
101C_bacteriaseqs_ATAGTATTGAGC_L001_R1_001.fastq.gz:6439
103A_bacteriaseqs_CACCTCTTGAGC_L001_R1_001.fastq.gz:478
104A_bacteriaseqs_GTCATCTTGAGC_L001_R1_001.fastq.gz:22
105B_bacteriaseqs_TTCTAGTTGAGC_L001_R1_001.fastq.gz:146
106B_bacteriaseqs_TTCTAGCGTGAT_L001_R1_001.fastq.gz:3528
109A_bacteriaseqs_AATGAGGAAATG_L001_R1_001.fastq.gz:1567
110A_bacteriaseqs_CACCTCGAAATG_L001_R1_001.fastq.gz:121
111C_bacteriaseqs_ATAGTACGTGAT_L001_R1_001.fastq.gz:823
113C_bacteriaseqs_TCGCCACGTGAT_L001_R1_001.fastq.gz:390
114A_bacteriaseqs_GTCATCCGTGAT_L001_R1_001.fastq.gz:7
115C_bacteriaseqs_TGCCCGCGTGAT_L001_R1_001.fastq.gz:87
116C_bacteriaseqs_ATAGTAGAAATG_L001_R1_001.fastq.gz:1157
117C_bacteriaseqs_TCGCCAGAAATG_L001_R1_001.fastq.gz:135
118C_bacteriaseqs_CGAGGCGAAATG_L001_R1_001.fastq.gz:249
120A_bacteriaseqs_GTCATCGAAATG_L001_R1_001.fastq.gz:108
121B_bacteriaseqs_TTCTAGGAAATG_L001_R1_001.fastq.gz:144
122C_bacteriaseqs_TGCCCGGAAATG_L001_R1_001.fastq.gz:49026
19B_bacteriaseqs_ACTGGCCCGAGG_L001_R1_001.fastq.gz:795
21A_bacteriaseqs_AATGAGCCGAGG_L001_R1_001.fastq.gz:1777
22A_bacteriaseqs_CACCTCCCGAGG_L001_R1_001.fastq.gz:1584
25A_bacteriaseqs_GTCATCCCGAGG_L001_R1_001.fastq.gz:572
26B_bacteriaseqs_CCGCATCCGAGG_L001_R1_001.fastq.gz:245
28A_bacteriaseqs_CTACAGCCGAGG_L001_R1_001.fastq.gz:199
29A_bacteriaseqs_AATGAGCGTCTA_L001_R1_001.fastq.gz:6386
31A_bacteriaseqs_CACCTCCGTCTA_L001_R1_001.fastq.gz:447
32B_bacteriaseqs_TACACACCGAGG_L001_R1_001.fastq.gz:202
34B_bacteriaseqs_TTCTAGCCGAGG_L001_R1_001.fastq.gz:498
35B_bacteriaseqs_ACTGGCCGTCTA_L001_R1_001.fastq.gz:137
37B_bacteriaseqs_CCGCATCGTCTA_L001_R1_001.fastq.gz:510
38A_bacteriaseqs_GTCATCCGTCTA_L001_R1_001.fastq.gz:715
41C_bacteriaseqs_TGCCCGCGTCTA_L001_R1_001.fastq.gz:37
42B_bacteriaseqs_TACACACGTCTA_L001_R1_001.fastq.gz:68
43B_bacteriaseqs_TTCTAGCGTCTA_L001_R1_001.fastq.gz:54
44A_bacteriaseqs_AATGAGCTTTGC_L001_R1_001.fastq.gz:2355
46A_bacteriaseqs_CACCTCCTTTGC_L001_R1_001.fastq.gz:464
48B_bacteriaseqs_ACTGGCCTTTGC_L001_R1_001.fastq.gz:256
49B_bacteriaseqs_CCGCATCTTTGC_L001_R1_001.fastq.gz:11976
51B_bacteriaseqs_TACACACTTTGC_L001_R1_001.fastq.gz:58
52B_bacteriaseqs_TTCTAGCTTTGC_L001_R1_001.fastq.gz:33
54B_bacteriaseqs_ACTGGCGCAGAT_L001_R1_001.fastq.gz:671
56B_bacteriaseqs_CCGCATGCAGAT_L001_R1_001.fastq.gz:716
57C_bacteriaseqs_CGAGGCGCAGAT_L001_R1_001.fastq.gz:78
59B_bacteriaseqs_TTCTAGGCAGAT_L001_R1_001.fastq.gz:30
64B_bacteriaseqs_ACTGGCGGCAAG_L001_R1_001.fastq.gz:1031
70B_bacteriaseqs_CCGCATGGCAAG_L001_R1_001.fastq.gz:649
72A_bacteriaseqs_GTCATCCTTTGC_L001_R1_001.fastq.gz:27
77B_bacteriaseqs_TACACAGGCAAG_L001_R1_001.fastq.gz:4251
78B_bacteriaseqs_TTCTAGGGCAAG_L001_R1_001.fastq.gz:2371
79A_bacteriaseqs_CTACAGCTTTGC_L001_R1_001.fastq.gz:87
80A_bacteriaseqs_AATGAGGCAGAT_L001_R1_001.fastq.gz:651
81A_bacteriaseqs_CACCTCGCAGAT_L001_R1_001.fastq.gz:552
82A_bacteriaseqs_GTCATCGCAGAT_L001_R1_001.fastq.gz:119
83A_bacteriaseqs_CTACAGGCAGAT_L001_R1_001.fastq.gz:124
85B_bacteriaseqs_ACTGGCTTGAGC_L001_R1_001.fastq.gz:1296
88B_bacteriaseqs_CCGCATTTGAGC_L001_R1_001.fastq.gz:1483
89B_bacteriaseqs_TACACATTGAGC_L001_R1_001.fastq.gz:1290
90A_bacteriaseqs_CTACAGTTGAGC_L001_R1_001.fastq.gz:68
91B_bacteriaseqs_ACTGGCCGTGAT_L001_R1_001.fastq.gz:4234
92B_bacteriaseqs_CCGCATCGTGAT_L001_R1_001.fastq.gz:1034
96A_bacteriaseqs_AATGAGGGCAAG_L001_R1_001.fastq.gz:2315

64_B_snippet.fastq (3.4 KB)
demux-summary.qzv (286.5 KB)

Nicholas_Bokulich · July 12, 2018, 7:06pm

that seems very strange. Were these data pre-processed in any way, or is this the raw data hot off the sequencer?

Needless to say, there should not be blank lines in there, particularly missing sequences!

But that seems like it is probably the issue here. Is there a rawer form of the data you could use?

Why this issue is not being caught during import is also troubling — could you please either send the entire file or try and find a snippet that shows that there really are blank lines in the sequence? You could do something like the following to see the lines before and after the blank lines:

grep -B 1 -A 2 "^$" sequences.fastq  | grep -v "^--$" > empty_seqs.fasta

(incidentally, you could probably also adapt that code to filter out fastq entries containing blank sequences from your fastq but I would be very cautious and you will probably only be able to process as single-end data thereafter)

An easier way to just bypass all of this might be to just use dada2 or deblur for denoising, rather than using vsearch for OTU picking. Sequences shorter than the truncation length you set will be dropped — so it may just get rid of all your problems without raising an error (and yield better data than OTUs in the process ).

Good luck!

mbnalbright · July 12, 2018, 9:32pm

In reply to: An easier way to just bypass all of this might be to just use dada2 or deblur for denoising, rather than using vsearch for OTU picking. Sequences shorter than the truncation length you set will be dropped — so it may just get rid of all your problems without raising an error (and yield better data than OTUs in the process.

I have also tried to run deblur and I get an error message, which is pasted below.

qiime deblur denoise-16S
--i-demultiplexed-seqs demultiplexed-seqs6BAC_Forward.qza
--p-trim-length 100
--p-sample-stats
--o-representative-sequences T6_rep-seqs.qza
--o-table T6_table.qza
--o-stats T6_deblur-stats.qza

Traceback (most recent call last):
File "/home/malbright/anaconda3/envs/qiime2-2018.2/bin/deblur", line 4, in
import('pkg_resources').run_script('deblur==1.0.3', 'deblur')
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/pkg_resources/init.py", line 750,
in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/pkg_resources/init.py", line 1527,
in run_script
exec(code, namespace, namespace)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/deblur-1.0.3-py3.5.egg-info/scripts/de
blur", line 684, in
deblur_cmds()
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/deblur-1.0.3-py3.5.egg-info/scripts/de
blur", line 632, in workflow
threads_per_sample=threads_per_sample)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/deblur/workflow.py", line 832, in laun
ch_workflow
left_trim_len=left_trim_length):
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/deblur/workflow.py", line 130, in trim_seqs
for label, seq in input_seqs:
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/deblur/workflow.py", line 99, in sequence_generator
for record in skbio.read(input_fp, format=format, **kw):
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/io/registry.py", line 506, in
return (x for x in itertools.chain([next(gen)], gen))
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/io/registry.py", line 531, in _read_gen
yield from reader(file, **kwargs)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/io/registry.py", line 1008, in wrapped_reader
yield from reader_function(fhs[-1], **kwargs)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/io/format/fastq.py", line 344, in _fastq_to_generator
seq, qual_header = _parse_sequence_data(fh, seq_header)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/io/format/fastq.py", line 481, in _parse_sequence_data
_blank_error("before '+'")
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/io/format/fastq.py", line 473, in _blank_error
raise FASTQFormatError(error_string)
skbio.io._exception.FASTQFormatError: Found blank or whitespace-only line before '+' in FASTQ file
Traceback (most recent call last):
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/q2cli/commands.py", line 246, in call
results = action(**arguments)
File "", line 2, in denoise_16S
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/qiime2/sdk/action.py", line 228, in bound_callable
output_types, provenance)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/qiime2/sdk/action.py", line 363, in callable_executor
output_views = self._callable(**view_args)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/q2_deblur/_denoise.py", line 96, in denoise_16S
hashed_feature_ids=hashed_feature_ids)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/q2_deblur/_denoise.py", line 163, in _denoise_helper
subprocess.run(cmd, check=True)
File "/home/malbright/anaconda3/envs/qiime2-2018.2/lib/python3.5/subprocess.py", line 398, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['deblur', 'workflow', '--seqs-fp', '/tmp/qiime2-archive-lf7gz2yt/e3ca7c3f-5578-42f4-b354-77f1959cbbbe/data', '--output-dir', '/tmp/tmpt1k4kr5c', '--mean-error', '0.005', '--indel-prob', '0.01', '--indel-max', '3', '--trim-length', '100', '--min-reads', '10', '--min-size', '2', '--jobs-to-start', '1', '-w', '--keep-tmp-files']' returned non-zero exit status 1

Nicholas_Bokulich · July 13, 2018, 11:40pm

Looks like deblur is running into the same error (makes sense — it also uses skbio for parsing fastq sequences):

You could try dada2 but it seems like there may be no sneaky workaround — you will need to remove the empty sequence records from your fastq. The easiest way to do this is probably to get the raw sequences, if that's not what you are using already (make sure the sequencing center does not do some kind of pre-processing).

Could you please still run the above command to confirm that there really are empty sequences?

Thanks!

mbnalbright · July 18, 2018, 2:21pm

That grep command that you sent is not coming up with any blank lines... so maybe there aren't any? Unfortunately, the files complete files are too large to attach in this forum. I also tested to see if there were blank lines in the original multiplexed form and there were not any. Any ideas on what to test next?

We use a double barcoding method 6 bp on each end, so the data was preprocessed to place the barcodes up front (12 bp) so that it is in a format that can be demultiplexed with standard programs.

Thanks, Michaeline

Nicholas_Bokulich · July 18, 2018, 4:02pm

Well that's weird. If grep -c "^$" works, the command I listed should do the same thing but save out the complete fastq entries to a file. Are you using the same input files?

oh it looks like maybe you ran grep -c "^$ on gzipped files — could you unzip a few and give it a try?

It's not too important — the error is pretty clear — I just wanted a good test case to use on this end, but I can synthesize one. (there are two issues afoot: empty sequences in your files are preventing you from progressing; but QIIME 2 should ideally be detecting these issues during import)

I expect that this is where the problem is creeping in. Somehow somewhere blank lines are being inserted and this is probably where it is happening.

Thank you for that detail, though! This gives us something to work from — I believe you should be able to use qiime1's extract_barcodes.py to do this same thing; might be worth giving that a try instead to see if it fixes this issue if that's not what you are already using.

We have yet to provide support for dual-indexing in QIIME 2 but it is on our radar, so in the future this should not be so troubling. Sorry we don't have this in place already!

mbnalbright · July 18, 2018, 9:06pm

I reran the grep command on multiple unzipped files and I did not find any blank lines in the files....
I also went back to the raw data from the sequence facility and I used qiime1's extract barcodes.py to create new files. I then put those files into the QIIME2 pipeline and I am running into another error.

#ERROR
Plugin error from cutadapt:

/tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-6zm8bwl4/97B_bacteriaseqs_TACACACGTGAT_L001_R1_001.fastq.gz is not a(n) FastqGzFormat file:

Missing sequence for record beginning on line 5

#Processing pipeline

extract_barcodes.py -f R1.fastq -r R2.fastq -c barcode_paired_end --bc1_len 6 --bc2_len 6 -o processed_seqs

qiime tools import
--type MultiplexedPairedEndBarcodeInSequence
--input-path PE/
--output-path multiplexed-seqs6_Q1.qza

qiime cutadapt demux-paired
--i-seqs multiplexed-seqs6_Q1.qza
--m-forward-barcodes-file RUN6_mapping.tsv
--m-forward-barcodes-column BarcodeSequence
--p-error-rate 0
--o-per-sample-sequences demultiplexed-seqs6BAC_Q1.qza
--o-untrimmed-sequences untrimmed.qza
--verbose

ERROR here...

Am I missing a step or still putting in data in the wrong format?

thermokarst · July 19, 2018, 1:25pm

Hey there @mbnalbright!

No, I don't think so, I think your steps proposed here look good to me.

The problem is that at least one of your reads in sample 97B_bacteriaseqs consists only of barcode sequence (as in, there is no biological signal in the read) --- when cutadapt is demuxing, it strips away the barcode (which is good, we want this). The problem here is that without the barcode, this particular read has nothing else, so there is a blank line. When QIIME 2 goes to save that file, it performs a bit of data validation - it sees the record without any nts in it in this sample, and et voila - the error you are seeing now has some context!

Okay, how to move forward...

I have an idea, but I am not sure if it will work.

Edit your RUN6_mapping.tsv file to remove the row that has barcode TACACACGTGAT
Re-run the qiime cutadapt demux-paired. Theoretically it will succeed, but, if not, you might need to remove a few more samples from step 1 --- let's play that by ear . Also, add the --verbose flag while running, just to get some more details.
From the logs in step 2, there will be a bunch of "Running external command line application. This may print messages to stdout and/or stderr" messages --- copy and paste one of those, and edit the filenames to point to your reads in the directory PE/, and save the new reads to some new location. You will also need to update the command to use the barcode TACACACGTGAT. If you need help crafting this command please just ping us here, but please include the --verbose output from step 2. We will also need to modify the command with the following parameter: -m 1, which will only save reads that are 1 nt long, or longer. I think this command will look something like this:

cutadapt \
  --front TACACACGTGAT \
  --error-rate 0 \
  -o {name}.1.fastq.gz \
  -p {name}.2.fastq.gz \
  -m 1 \
  PE/forward.fastq.gz \
  PE/reverse.fastq.gz

Run qiime tools export on the results from step 2. Export to a new dir.
Copy and paste the results created from step 3 into the directory created in step 4.
Delete any files in the dir from step 4 that aren't fastq.gz files. These should be MANIFEST and metadata.yml
Import the dir from 4.
Get back to deblurring!

Yikes, that looks like a lot of steps, but I think its mostly because I was a bit chatty in there. Let's give that a shot and see what we can do!

mbnalbright · July 25, 2018, 1:45pm

Thanks @thermokarst, I removed that barcode TACACACGTGAT from the mapping file and and cutadapt worked!

I am a little bit confused about the second part of what you suggested and what it is supposed to do.

thermokarst · July 25, 2018, 2:06pm

Awesome! So, without that barcode, that means that the sample (97B_bacteriaseqs) is now stripped from your data. If that is acceptable, go ahead and proceed as usual, if not, we can talk about steps 3-8 above, which are intended to reconcile that issue (demux and integrate 97B_bacteriaseqs into the data). Thanks!

mbnalbright · July 25, 2018, 2:20pm

I am okay with losing that sample, so I will continue with the analysis. For future reference, if there was more than one barcode/file causing problems, in terms of troubleshooting, would all of the 'problem barcodes' come up in the initial error file or would it be an iterative process of removing one and then running again and removing the next one?
Thanks again for the help.

thermokarst · July 27, 2018, 12:30am

It would come up one at a time, so you would need to do the iterative approach.