Feature classifiier consensus vsearch - key error

anan · June 28, 2018, 12:04pm

hello there fellow qiimers,

Ive been running this command for vsearch for a lot of my analysis using different databases. Lately Ive been running through an issue with my BOLD database. When I run the consensus vsearch command using it it keeps giving me this error:

KeyError: 'Identifier 342 was reported in taxonomic search results, but was not present in the reference taxonomy.'

I looked for that identifier in both my fasta and taxanonomy files, but I cant find it. My files are both headerless and thus I imported them using:

qiime tools import
--type 'FeatureData[Taxonomy]'
--source-format HeaderlessTSVTaxonomyFormat
--input-path BOLDtaxa.txt
--output-path BOLDtaxa.qza

and

qiime tools import
--type 'FeatureData[Sequence]'
--input-path BODgenes.fasta
--output-path BOLDgenes.qza

But still the error persists. I would really appreciate it if u can help me with this issue.

Many thanks.
Anan

Nicholas_Bokulich · June 28, 2018, 12:20pm

This error indicates that a sequence identifier present in the fasta is not present in the taxonomy files.

Spaces or special characters in the header line could cause that line to be trimmed, resulting in this error — look for 342 anywhere in that file.

What is the line count for each file? Do they have the same # of entries?

anan · June 29, 2018, 12:22pm

i checked the number of identifiers in both files and they are similar. I checked for 342 couldnt find it anywhwere in the files, both, taxa and the fasta.

Here is a copy of the error message i keep receiving. I wold really appreciate it if u can offer me advice on how to proceed:

Command: vsearch --usearch_global /store2/anan/tmp/qiime2-archive-24c3b9v1/0ecfe9ea-a76d-4ba9-8185-b5630d4c1fad/data/dna-sequences.fasta --id 0.97 --strand both --maxaccepts 3 --maxrejects 0 --output_no_hits --db /store2/anan/tmp/qiime2-archive-j7aw57gj/4b0cf6df-445c-4121-aaef-0836a5e9a7df/data/dna-sequences.fasta --threads 20 --blast6out /store2/anan/tmp/tmpkcrzrl_l

vsearch v2.7.0_linux_x86_64, 503.7GB RAM, 80 cores

Reading file /store2/anan/tmp/qiime2-archive-j7aw57gj/4b0cf6df-445c-4121-aaef-0836a5e9a7df/data/dna-sequences.fasta 100%
937180679 nt in 1506031 seqs, min 55, max 2927, avg 622
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching query sequences: 229 of 10378 (2.21%)
Traceback (most recent call last):
File "/comp2/anan/Anaconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 2566, in get_value
return libts.get_value_box(s, key)
File "pandas/_libs/tslib.pyx", line 1017, in pandas._libs.tslib.get_value_box
File "pandas/_libs/tslib.pyx", line 1025, in pandas._libs.tslib.get_value_box
TypeError: 'str' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/comp2/anan/Anaconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/q2_feature_classifier/_consensus_assignment.py", line 104, in import_blast_format_assignments
t = ref_taxa[id]
File "/comp2/anan/Anaconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/pandas/core/series.py", line 623, in getitem
result = self.index.get_value(self, key)
File "/comp2/anan/Anaconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 2574, in get_value
raise e1
File "/comp2/anan/Anaconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 2560, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas/_libs/index.pyx", line 83, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 91, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '342'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/comp2/anan/Anaconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/q2cli/commands.py", line 274, in call
results = action(**arguments)
File "", line 2, in classify_consensus_vsearch
File "/comp2/anan/Anaconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/qiime2/sdk/action.py", line 231, in bound_callable
output_types, provenance)
File "/comp2/anan/Anaconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/qiime2/sdk/action.py", line 366, in callable_executor
output_views = self._callable(**view_args)
File "/comp2/anan/Anaconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/q2_feature_classifier/_vsearch.py", line 35, in classify_consensus_vsearch
unassignable_label=unassignable_label)
File "/comp2/anan/Anaconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/q2_feature_classifier/_consensus_assignment.py", line 29, in _consensus_assignments
output.name, ref_taxa, unassignable_label=unassignable_label)
File "/comp2/anan/Anaconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/q2_feature_classifier/_consensus_assignment.py", line 109, in import_blast_format_assignments
'taxonomy.').format(str(id)))
KeyError: 'Identifier 342 was reported in taxonomic search results, but was not present in the reference taxonomy.'

Nicholas_Bokulich · June 29, 2018, 2:47pm

could you share these files?
BOLDtaxa.txt
BODgenes.fasta

You can send to me in a direct message if you don't want to post these publicly.

Or if you cannot share at the very least for starters please run these commands and post the outputs:

head BOLDtaxa.txt
head BODgenes.fasta
wc -l BOLDtaxa.txt
wc -l BODgenes.fasta

Thanks!

anan · July 2, 2018, 2:09pm

HI Nicholas,

I ran those commands and as far as I can see when compared to the templates in the Moving picture tutorial they are similar ad meet the requirements for reference database structure.

Its alright, I attached those two files in this link:
https://www.dropbox.com/s/3ycq52pbsoce/taxa3UPDATED.txt?dl=
and
https://www.dropbox.com/s/f8y9imz5pdb2h/genes-new2.fasta?dl=

Many thanks!

Nicholas_Bokulich · July 2, 2018, 5:56pm

Thanks for sharing!

This one stumped me for a while because your files look fine to the naked eye, have the same number of entries, etc, and I don't have your query sequences to replicate the exact error you have.

But I believe I discovered the problem:

Your fasta file (but not your taxonomy file) contains invisible special characters (^M) at the end of the accession #s (this is a windows newline character). vsearch seems to be interpreting this newline character as part of the accession # and hence there is a mismatch and so much chaos. You can use something like dos2unix to convert your fasta file, and then everything should be okay.

Please give that a try and let me know if it works!

anan · July 4, 2018, 4:14pm

HI Nicholas,

Yes, finally it worked, i really really appreciate it.

Thanks a lot!!

Best
Anan

system · August 4, 2018, 10:21pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.