KeyError: Identifier reported in taxonomic search results, but not reference taxonomy

steff1088 · January 16, 2018, 10:26pm

two weeks ago I was working on the right configuration to classify samples with a high number of unassigned sequences (see previous). I thought I would have found the right parameter settings including the right combination of perc-identity in clustering and classification. Now, against my expectations that this problem was solved, a job that was running for a couple of days stopped abruptly. I was able to grab the log file created:

KeyError: 'Identifier AB001758.1.1756 was reported in taxonomic search results, but was not present in the reference taxonomy.'

Here the command I used for open-reference clustering:

$qiime vsearch cluster-features-open-reference _
_ --i-sequences nonchimeric.qza _
_ --i-table table_nonchimeric.qza _
_ --i-reference-sequences 99_otus.qza _
_ --p-perc-identity 0.97 _
_ --p-threads 3 _
_ --o-clustered-table seqs_97-99or-table.qza _
_ --o-clustered-sequences 97-99or-seqs.qza _
_ --o-new-reference-sequences 97-99or-newREFseqs.qza

And here the one I used for Vsearch classification:

$qiime feature-classifier classify-consensus-vsearch _
_ --i-query seqs_filtered.qza _
_ --i-reference-reads 99_otus.qza _
_ --i-reference-taxonomy 99_majority_taxonomy_all_levels.qza _
_ --p-perc-identity 0.94 _
_ --p-min-consensus 0.6 _
_ --p-maxaccepts 10 _
_ --p-threads 5 _
_ --o-classification 94-99-99taxonomy-vsearch-SILVA_128.qza

I have no clue why the job aborted since all reference files (seqs and taxonomy) are based on the same ID threshold. In between clustering and classification I only filtered singletons out.

Any ideas?

steffen

Nicholas_Bokulich · January 16, 2018, 11:43pm

Hi @steff1088,
My guess is that this error actually involves the way that you imported your reference taxonomy files.

The importer does not check whether the taxonomy file contains a header or not; the user needs to tell it whether to import a HeaderlessTSVTaxonomyFormat or TSVTaxonomyFormat. If TSVTaxonomyFormat is used but the file contains a header, then it will be missing from the taxonomy reference file even though it might be present in the sequences file, leading to an error like this.

Could you check on how you imported this file, and whether the file actually contains a header?

steff1088 · January 17, 2018, 3:40pm

Hi @Nicholas_Bokulich,

I have done the filtering the old way with:

qiime feature-table filter-features
echo 'FeatureID','Frequency' | cat - sample_feature-frequency-detail.csv | tr "," "\\t" > sample_features-to-retain.tsv
qiime feature-table filter-seqs

I know, in release 12 this has been summarized in one command, can you point me to which one that is? I had problems with the header erasing my top feature before and I am pretty sure I fixed it - at least it worked in test runs after the modification.

cheers,
steffen

Nicholas_Bokulich · January 17, 2018, 4:00pm

Hi @steff1088,
As of version 2017.12, feature-table filter-seqs accepts an optional table as input. When a feature table is included as input, the input seqs will be filtered to only include features present in the feature table.

You are correct, this could also be the cause of your KeyError. That error can be pretty cryptic... it will only be detected when the feature missing from the taxonomy file is used for a classification, so might not be caught in any of your test runs (unless if, e.g., you test that all sequences are classified when aligning against the same file).

Good luck! Let us know what you dig up!

steff1088 · January 18, 2018, 3:48pm

Hi @Nicholas_Bokulich,

a second run stopped giving me the same error. Just so I checked the taxonomy import as you suggested as well. My import command:

      qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --source-format HeaderlessTSVTaxonomyFormat \
  --input-path 99_majority_taxonomy_all_levels.txt \
  --output-path 99_majority_taxonomy_all_levels.qza

and the head of the txt file looks like:

KF494428.1.1396	D_0__Bacteria;D_1__Proteobacteria;D_2__Epsilonproteobacteria;D_3__Campylobacterales;D_4__Helicobacteraceae;D_5__Sulfuricurvum;D_6__Sulfuricurvum sp. EW1;D_7__;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__
DQ841227.1.1285	D_0__Archaea;D_1__Woesearchaeota (DHVEG-6);D_2__uncultured archaeon;D_3__;D_4__;D_5__;D_6__;D_7__;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__
AF506248.1.1375	D_0__Bacteria;D_1__Cyanobacteria;D_2__Cyanobacteria;D_3__SubsectionIV;D_4__FamilyI;D_5__Nostoc;D_6__Nostoc sp. 'Nephroma expallidum cyanobiont 23';D_7__;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__
HQ774489.1.1456	D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Enterobacteriales;D_4__Enterobacteriaceae;D_5__Klebsiella;D_6__uncultured organism;D_7__;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__
JN507345.1.1370	D_0__Bacteria;D_1__Chloroflexi;D_2__Anaerolineae;D_3__Anaerolineales;D_4__Anaerolineaceae;D_5__uncultured;D_6__uncultured organism;D_7__;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__
JX218299.1.1475	D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__Ruminococcaceae;D_5__Ruminococcaceae V9D2013 group;D_6__uncultured rumen bacterium;D_7__;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__
DQ351908.1.1580	D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__Peptococcaceae;D_5__Thermincola;D_6__uncultured bacterium;D_7__;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__
AF282252.1.1505	D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Thermolithobacterales;D_4__Thermolithobacteraceae;D_5__Thermolithobacter;D_6__Thermolithobacter ferrireducens;D_7__;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__
EU

So the source format should be right since there is no header. The error received from the second run is consistent wit the first error message, just the missing feature is another one.

I have filtered the data now using the newly introduced way with just two commands and no work-around with a tsv file. I keep trying to figure this out..

steffen

steff1088 · January 18, 2018, 3:59pm

Also, just to clarify: Can we exclude any problems potentially caused by the combination of ID thresholds in clustering and classification? In both processes, 99% ref seqs and taxonomy was used, but it was clustered at 97% and classified at 94%. In my understanding that should be compatible - so just to check.

steffen

Nicholas_Bokulich · January 18, 2018, 4:08pm

Hi @steff1088,
Thanks for following up!

Yes, since the import command was correct, it sounds like maybe the issue is coming from that workaround TSV (potentially not being imported correctly as you indicated). It might be worth checking if/where the missing features are in those TSVs to diagnose where this is occurring... though if the error goes aways now that you are not using that workaround, maybe it is best to just leave it at that!

Yes. As long as the reference taxonomy and sequences are consistent, nothing else should matter. This KeyError concerns reference sequences that are not found in the reference taxonomy. The OTU clustering threshold in your query sequences (97%) and the percent identity threshold for finding a match during classification (94%) should have absolutely no effect (though theoretically the error could appear/disappear when you toggle perc-identity just because the classifier is finding/excluding different matches based on this threshold and other parameters, giving the illustion that it is caused by these parameters... it is not )

Please let us know if using filter-seqs with a table input fixes your issue or not! Good luck!

steff1088 · February 13, 2018, 4:37pm

@Nicholas_Bokulich sorry I just realized I have not given you feedback on this yet.

I solved the problem.. it was one of those "stupid" problems where you overlook something obvious, so dont judge me on this:

In the SILVA reference seqs and taxonomy folder for qiime2, there are the reference files for all sequences and restricted to 16S only. Now, where as the otus.fasta files are distinguishable based on their file name otus_16S.fasta, the taxonomy files from both groups are not! So, the majority_taxonomy_all_levels.txt has the identical name in the 16S_only folder and the taxonomy_all folder. I combined the wrong target sequences of reference seqs and taxonomy and so an identifier from the ref seqs showed up that was not found in the (more limited) taxonomy repertoire.

I hope nobody else makes that mistake - take care of which files you chose from the SILVA directory!

cheers,
steffen

system · March 16, 2018, 10:37pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.