Problem understanding how to create UNITE database - FormatIdentificationWarning: ...does not look like a fasta file

minardsmitha · May 15, 2019, 6:00pm

I am working with Qiime 2 (2019.1.0) in a shared high performance cluster. My data is Illumina MySeq paired-end fastq files.
I have done the following process and want to now cluster my sequences. My commands are below. How do I use vsearch cluster-features-closed-reference correctly with the UNITE database?
I downloaded the UNITE database from https://plutof.ut.ee/#/doi/10.15156/BIO/786349 which I found from the Qiime Data resources page. Is there something else I need to do with the UNITE database?

-Gunzip all fastq files.
-Discard empty files.
-Run cutadapt-paired on unzipped files to remove primers and discard sequences less than 75bp.
-Gzip cutadapt trimmed files to use in Qiime

#Create reference database
qiime tools import --type 'FeatureData[Sequence]' --input-path /hpc/databases/Qiime/UNITE/sh_refs_qiime_ver8_97_s_02.02.2019.fasta
--input-format DNAFASTAFormat
--output-path /hpc/databases/Qiime/UNITE/sh_refs_qiime_ver8_97_s_02.02.2019.qza

import FASTQ files
qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]'
--input-path /hpc/CBRDC
--input-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path /hpc/CBRDC/demux-paired-end.qza

merge reads
qiime vsearch join-pairs --i-demultiplexed-seqs /hpc/CBRDC/demux-paired-end.qza
--p-minovlen 1 --p-minmergelen 75 --p-maxns 1
--o-joined-sequences /hpc/CBRDC/joined_sequences.qza --verbose

quality filter reads
qiime quality-filter q-score-joined --i-demux /hpc/CBRDC/joined_sequences.qza
--p-min-quality 20 --p-quality-window 75 --p-max-ambiguous 1
--o-filtered-sequences /hpc/CBRDC/joined_filt.qza
--o-filter-stats /hpc/CBRDC/joined_filt_stats.qza --verbose

deblur
qiime deblur denoise-16S --i-demultiplexed-seqs /hpc/CBRDC/joined_filt.qza
--p-trim-length 420 --p-sample-stats
--o-table /hpc/CBRDC/table-deblur.qza
--o-representative-sequences /hpc/CBRDC/rep-seqs-deblur.qza
--o-stats /hpc/CBRDC/stats-deblur.qza --verbose

closed reference clustering
qiime vsearch cluster-features-closed-reference
--i-sequences /hpc/CBRDC/rep-seqs-deblur.qza
--i-table /hpc/CBRDC/table-deblur.qza
--i-reference-sequences /hpc/databases/Qiime/UNITE/sh_refs_qiime_ver8_97_s_02.02.2019.qza
--o-unmatched-sequences /hpc/CBRDC/clustered_unmatched.qza
--p-perc-identity .97 --p-threads 20 --verbose
--o-clustered-table /hpc/CBRDC/clustered_table.qza
--o-clustered-sequences /hpc/CBRDC/clustered_seqs.qza

#Error from qiime vsearch cluster-features-closed-reference

vsearch v2.7.0_linux_x86_64, 251.5GB RAM, 40 cores

Reading file /tmp/qiime2-archive-s9x9_r9q/cff10fb5-6628-4a19-9c65-e905d9e39c3c/data/dna-sequences.fasta 100%
34133887 nt in 62320 seqs, min 149, max 3526, avg 548
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%

Unable to read from file (/tmp/tmp7xkmlgjb)
/hpc/apps/qiime2-2019.1/install/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/qiime2-archive-1nbj7rb6/bd2bc7c9-c392-4cb9-82c1-1a1877942197/data/dna-sequences.fasta'> does not look like a fasta file
% (file, fmt), FormatIdentificationWarning)
Traceback (most recent call last):
File "/hpc/apps/qiime2-2019.1/install/lib/python3.6/site-packages/q2cli/commands.py", line 274, in call
results = action(**arguments)
File "</hpc/apps/qiime2-2019.1/install/lib/python3.6/site-packages/decorator.py:decorator-gen-122>", line 2, in cluster_features_closed_reference
File "/hpc/apps/qiime2-2019.1/install/lib/python3.6/site-packages/qiime2/sdk/action.py", line 231, in bound_callable
output_types, provenance)
File "/hpc/apps/qiime2-2019.1/install/lib/python3.6/site-packages/qiime2/sdk/action.py", line 365, in callable_executor
output_views = self._callable(**view_args)
File "/hpc/apps/qiime2-2019.1/install/lib/python3.6/site-packages/q2_vsearch/_cluster_features.py", line 256, in cluster_features_closed_reference
run_command(cmd)
File "/hpc/apps/qiime2-2019.1/install/lib/python3.6/site-packages/q2_vsearch/_cluster_features.py", line 33, in run_command
subprocess.run(cmd, check=True)
File "/hpc/apps/qiime2-2019.1/install/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['vsearch', '--usearch_global', '/tmp/tmp7xkmlgjb', '--id', '0.97', '--db', '/tmp/qiime2-archive-s9x9_r9q/cff10fb5-6628-4a19-9c65-e905d9e39c3c/data/dna-sequences.fasta', '--uc', '/tmp/tmpjygew9qe', '--strand', 'plus', '--qmask', 'none', '--notmatched', '/tmp/tmp9eljdk2o', '--threads', '20']' returned non-zero exit status 1.

Plugin error from vsearch:

Command '['vsearch', '--usearch_global', '/tmp/tmp7xkmlgjb', '--id', '0.97', '--db', '/tmp/qiime2-archive-s9x9_r9q/cff10fb5-6628-4a19-9c65-e905d9e39c3c/data/dna-sequences.fasta', '--uc', '/tmp/tmpjygew9qe', '--strand', 'plus', '--qmask', 'none', '--notmatched', '/tmp/tmp9eljdk2o', '--threads', '20']' returned non-zero exit status 1.

See above for debug info.
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /tmp/tmp7xkmlgjb --id 0.97 --db /tmp/qiime2-archive-s9x9_r9q/cff10fb5-6628-4a19-9c65-e905d9e39c3c/data/dna-sequences.fasta --uc /tmp/tmpjygew9qe --strand plus --qmask none --notmatched /tmp/tmp9eljdk2o --threads 20

minardsmitha · May 15, 2019, 6:35pm

I think I see one problem. I used

qiime deblur denoise-16S

but I should have used

qiime deblur denoise-other

.