Can't get deblur to finish running with large (8 GB) .qza file

Steven_Mamet · March 16, 2019, 11:32am

Hello, I've been trying to run our large dataset of 16S rRNA sequences through qiime2-2019.1 on our server without success. I can run subsets of the data but not the full dataset.

Here is the command I've been trying to run:

qiime deblur denoise-16S
--i-demultiplexed-seqs reads_qza/reads_trimmed_filt.qza
--p-trim-length 190
--p-jobs-to-start 60
--o-representative-sequences reads_qza/rep-seqs-deblur.qza
--o-table table-deblur.qza
--p-sample-stats
--o-stats deblur-stats.qza

And here is the error I've been consistently getting:

Plugin error from deblur:

Command '['deblur', 'workflow', '--seqs-fp', '/u1/sdm231/canola/qiime2-tmp/qiime2-archive-a8skcrmf/bf98b765-79fb-4553-8471-e86382f67349/data', '--output-dir', '/u1/sdm231/canola/qiime2-tmp/tmp3ej_o739', '--mean-error', '0.005', '--indel-prob', '0.01', '--indel-max', '3', '--trim-length', '190', '--min-reads', '10', '--min-size', '2', '--jobs-to-start', '60', '-w', '--keep-tmp-files']' returned non-zero exit status 1.

Debug info has been saved to /u1/sdm231/canola/qiime2-tmp/qiime2-q2cli-err-s8qx2i6n.log

Here is a link to the deblur log file:
https://www.dropbox.com/s/ietijvo0pzmaljz/deblur.log?dl=0

I feel like the error might be related to not enough room in my partition on the server to accommodate the tmp files generated during processing, but I'm open to suggestions as this has been plaguing me for weeks. I was able to run a this dataset through qiime2-2018.11 no problem but just can't get it to run with 2019.1.

I've tried:

Changing temporary directory (e.g., export TMPDIR="$PWD/qiime2-tmp/" )
Removing --p-sample-stats
Changing --p-trim-length
Changing --p-jobs-to-start (our server has 64 available)

Any insights?

Thanks for any help you can provide!

Steve

jwdebelius · March 16, 2019, 11:42am

Hi @Steven_Mamet,

I don’t have a genius solution directly, but I might be able to make some recomendafions. First, I’d clean out your tmp folder every time. I’d also check with your server admin to see if they have a system scratch folder as their tmp. (Although every set up is different, this is often a folder that gets emptied quickly and may have better read/write than other areas.)

Next, if it’s not processing pooled, I’d just process subsets. It’s lazy, manual parallelization and again, not a deeper solution to your problem, but might let you test the memory issue and may just get the table deblurred. I think that deblur is probably more robust for parallelization than Dada2 because it doesn’t infer the error model from the data or chimeras.

Again, not clever but maybe practical.

Best,
Justine

colinbrislawn · March 17, 2019, 6:07pm

Hello Steven,

I'm totally support Justine's suggestion. "Deblur operates on a per-sample level" (cite), so you can simply process all your subsets, then merge them all together into a single table.

And I think this might be the official way to solve your problem!

Colin

Steven_Mamet · March 17, 2019, 7:05pm

Hi all,

I had hoped to avoid processing subsets (I found I could process 500 fastqs at a time, so this will mean running the pipeline 10 times with 5000+ fastq files). It's especially odd since the pipeline worked on a comparable data set with 2018.11, but not with 2019.1.

I appreciate the input—I don't think manual parallelization is a permanent solution to the problem, though it will allow me to complete this particular task.

If I do find a solution to the problem I've encountered, I'll be sure to share in case it's of use.

Cheers,

Steve

Steven_Mamet · March 18, 2019, 2:32pm

So I think changing the temporary directory solved any storage issues. But there is an issue with two of my fastq files. If I remove these, the subset pipeline will run, but I can't seem to find the issue within the files.

Everything will run up until the deblur step. Here is the output from the tmp log:

Traceback (most recent call last):
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/bin/deblur", line 684, in
deblur_cmds()
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/bin/deblur", line 632, in workflow
threads_per_sample=threads_per_sample)
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/deblur/workflow.py", line 833, in launch_workflow
left_trim_len=left_trim_length):
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/deblur/workflow.py", line 130, in trim_seqs
for label, seq in input_seqs:
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/deblur/workflow.py", line 99, in sequence_generator
for record in skbio.read(input_fp, format=format, **kw):
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/skbio/io/registry.py", line 506, in
return (x for x in itertools.chain([next(gen)], gen))
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/skbio/io/registry.py", line 531, in _read_gen
yield from reader(file, **kwargs)
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/skbio/io/registry.py", line 1008, in wrapped_reader
yield from reader_function(fhs[-1], **kwargs)
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/skbio/io/format/fastq.py", line 344, in _fastq_to_generator
seq, qual_header = _parse_sequence_data(fh, seq_header)
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/skbio/io/format/fastq.py", line 481, in _parse_sequence_data
_blank_error("before '+'")
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/skbio/io/format/fastq.py", line 473, in _blank_error
raise FASTQFormatError(error_string)
skbio.io._exception.FASTQFormatError: Found blank or whitespace-only line before '+' in FASTQ file
Traceback (most recent call last):
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/q2cli/commands.py", line 274, in call
results = action(**arguments)
File "</home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/decorator.py:decorator-gen-432>", line 2, in denoise_16S
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/action.py", line 231, in bound_callable
output_types, provenance)
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/action.py", line 365, in callable_executor
output_views = self._callable(**view_args)
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_deblur/_denoise.py", line 96, in denoise_16S
hashed_feature_ids=hashed_feature_ids)
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_deblur/_denoise.py", line 163, in denoise_helper
subprocess.run(cmd, check=True)
File "/home/sdm231/miniconda3/envs/qiime2-2019.1/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['deblur', 'workflow', '--seqs-fp', '/tmp/qiime2-archive-lrbcucl/6c54ea4a-f0b8-494f-b213-6d43c30aa83f/data', '--output-dir', '/tmp/tmp_bnyu_ns', '--mean-error', '0.005', '--indel-prob', '0.01', '--indel-max', '3', '--trim-length', '250', '--min-reads', '10', '--min-size', '2', '--jobs-to-start', '1', '-w', '--keep-tmp-files']' returned non-zero exit status 1.

Which seems consistent with this:

But these are straight from the sequencer and I'm not sure how to troubleshoot this. I could proceed without these sampled, but I'm curious if there is anything I can do to fix this. I've attached the forward-reverse reads for one sample here if the qiime2 brain collective has any advice:

ORIG2CR1156NW05000L_R1.fastq.gz (1.1 MB)
ORIG2CR1156NW05000L_R2.fastq.gz (1.3 MB)

Thanks for your help!

Steve

thermokarst · March 19, 2019, 2:50pm

Can you run qiime tools validate on this file? That might point out any issues with that sample pair.

wasade · March 20, 2019, 1:07am

Hi @Steven_Mamet,

Based on the exception, it appears that one of the sequence files is either corrupt or empty:

My guess is that it is most likely a file that is empty. Can you verify that all of the fastq.gz files contain sequence data? The error output in the log file suggests some files have very few sequences ("ORIG2CR1117NW06000L" reports as having 3, some as 1 or 2, etc).

Thanks,
Daniel

Steven_Mamet · March 20, 2019, 3:52pm

Hi Matt and Daniel,

I ran qiime tools validate on the reads_trimmed_filt.qza file and it output this message:
Result reads_trimmed_filt.qza does not appear to be valid at level=max:

/tmp/qiime2-archive-c9sunmwx/3e9bfb93-796f-4b7a-a55a-ff88860e01c2/data/ORIG2CR1231NW08000L_1769_L001_R1_001.fastq.gz is not a(n) FastqGzFormat file:

Missing sequence for record beginning on line 33681

When I open up that particular file in Text Wrangler, I can't spot anything obvious on line 33681. I've uploaded the pair here to see if you guys can spot something I'm missing:

https://www.dropbox.com/s/uujw1gt7jwa5cpa/ORIG2CR1231NW08000L_R1.fastq.gz?dl=0
https://www.dropbox.com/s/cpckxlfk83md6mj/ORIG2CR1231NW08000L_R2.fastq.gz?dl=0

I'll remove this file from the pipeline and proceed and see if it works.

Thanks for all of your help!

Steve

thermokarst · March 20, 2019, 10:23pm

Hi @Steven_Mamet --- I just imported and validated the files for that sample on my machine, everything looks good. It looks like you pulled these data from your pre-import files, but maybe it makes more sense to export reads_trimmed_filt.qza and look at ORIG2CR1231NW08000L_1769_L001_R1_001.fastq.gz within that file...

Also, how did you import your sequences? Is it possible you swapped IDs on accident somewhere?

Steven_Mamet · March 20, 2019, 10:49pm

Hi @thermokarst, thanks for checking those. Just to be thorough, here are the steps I took prior to deblurring:

parallel --link --jobs 50 'cutadapt
--pair-filter any
--no-indels
--discard-untrimmed
-g CTACGGGGGGCAGCAG
-G GGACTACCGGGGTATCT
-o primer_trimmed_fastqs/{1/}
-p primer_trimmed_fastqs/{2/}
{1} {2}
> primer_trimmed_fastqs/{1/}_cutadapt_log.txt' ::: raw_data/_R1.fastq.gz ::: raw_data/_R2.fastq.gz

qiime tools import --type SampleData[PairedEndSequencesWithQuality]
--input-path manifest.csv
--output-path reads_qza/reads_trimmed.qza
--input-format PairedEndFastqManifestPhred33

qiime quality-filter q-score
--i-demux reads_qza/reads_trimmed.qza
--o-filtered-sequences reads_qza/reads_trimmed_filt.qza
--o-filter-stats filt_stats.qza

And finally deblur:

qiime deblur denoise-16S
--i-demultiplexed-seqs reads_qza/reads_trimmed_filt.qza
--p-trim-length 200
--o-representative-sequences reads_qza/rep-seqs-deblur.qza
--o-table reads_qza/table-deblur.qza
--p-sample-stats
--o-stats deblur-stats.qza

I had been running the validate step and sequentially removing the files from the primer trimmed fastqs and then re-running, and it keeps telling outputting errors about files. I can't find anything wrong with the fastqs and so I'm baffled as to why there are errors. Maybe it's the way I've run cutadapt (i.e., cutadapt versus q2-cutadapt)?

Steven_Mamet · March 21, 2019, 5:12pm

Hi @thermokarst

Okay, there was some problem with cutadapt that was giving the missing sequence error. Here's what I tried to solve the issue:

Use q3-cutadapt:
qiime tools import --type SampleData[PairedEndSequencesWithQuality]
--input-path raw_data
--output-path reads_qza/reads.qza
--input-format CasavaOneEightSingleLanePerSampleDirFmt

qiime cutadapt trim-paired --i-demultiplexed-sequences reads_qza/reads.qza
--p-cores 60
--p-adapter-f CTACGGGGGGCAGCAG
--p-adapter-r GGACTACCGGGGTATCT
--output-dir primer_trimmed_fastqs/

I was still getting the same error with this approach.

cutadapt 2.1 versus 1.18 and adding --minimum-length 1:
parallel --link --jobs 50 'cutadapt
--pair-filter any
--no-indels
--discard-untrimmed
-g CTACGGGGGGCAGCAG
-G GGACTACCGGGGTATCT
--minimum-length 1
-o primer_trimmed_fastqs/{1/}
-p primer_trimmed_fastqs/{2/}
{1} {2} \

primer_trimmed_fastqs/{1/}_cutadapt_log.txt' ::: raw_data/_R1.fastq.gz ::: raw_data/_R2.fastq.gz

I realized the cutadapt installed on the server was old and so updated to 2.1 and added the minimum length step. This worked and the qiime validate step said the reads.qza file was good! So I was able to finish the rest of the pipeline.

So I think without the minimum length restriction, there was whitespace or some erroneous data ending up in the trimmed files that would cause problems later on in the pipeline.

Now just to figure out how to merge this table with a previous run. I've seen some other threads on here about that so hopefully you won't hear from me for a while

Thanks,

Steve

thermokarst · March 21, 2019, 5:25pm

Ah ha! That is totally it! We have seen this crop up a few times before, and have put this on our todo list to expose in q2-cutadapt. Looks like we also need to expose the --pair-filter flag. Thanks for updating us!

I believe in you - you can do it!

:qiime2:

Steven_Mamet · March 22, 2019, 7:35pm

Hi @thermokarst and @wasade,

So after ironing all of the hurdles I've encountered running through the pipeline for our project, I've put together a pipeline based mostly on the qiime2 tutorials, but also on idiosyncrasies with the server I was running things on. I've put it together for the graduate students working on this project, but thought other people might be interested too:

https://github.com/StevenMamet/qiime2-P2IRC

Apologies if it looks weird, it's my first GitHub post.

Thanks again for your help.

Cheers,

Steve

Nicholas_Bokulich · March 22, 2019, 7:52pm

Thanks a ton for sharing, @Steven_Mamet! At the risk of stating the obvious, I will say that posting workflows like this to github is not only useful for students and fellow forum users, but can be really useful for reporting results in publication, and you can even post links to raw data (e.g., deposited in qiita or SRA or EBI) so that readers can recapitulate your results.

Steven_Mamet · March 22, 2019, 8:04pm

Thanks @Nicholas_Bokulich! Once our analyses are finalized, we will post the qza files because we love the provenance aspect—which has really come in handy as we're revisiting past analyses (so we know what we did with which files ;). Plus, we can be fully transparent with our workflows!

system · April 23, 2019, 2:04am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.