two weeks ago I was working on the right configuration to classify samples with a high number of unassigned sequences (see previous). I thought I would have found the right parameter settings including the right combination of perc-identity in clustering and classification. Now, against my expectations that this problem was solved, a job that was running for a couple of days stopped abruptly. I was able to grab the log file created:
KeyError: 'Identifier AB001758.1.1756 was reported in taxonomic search results, but was not present in the reference taxonomy.'
Here the command I used for open-reference clustering:
I have no clue why the job aborted since all reference files (seqs and taxonomy) are based on the same ID threshold. In between clustering and classification I only filtered singletons out.
Hi @steff1088,
My guess is that this error actually involves the way that you imported your reference taxonomy files.
The importer does not check whether the taxonomy file contains a header or not; the user needs to tell it whether to import a HeaderlessTSVTaxonomyFormat or TSVTaxonomyFormat. If TSVTaxonomyFormat is used but the file contains a header, then it will be missing from the taxonomy reference file even though it might be present in the sequences file, leading to an error like this.
Could you check on how you imported this file, and whether the file actually contains a header?
I know, in release 12 this has been summarized in one command, can you point me to which one that is? I had problems with the header erasing my top feature before and I am pretty sure I fixed it - at least it worked in test runs after the modification.
Hi @steff1088,
As of version 2017.12, feature-table filter-seqs accepts an optional table as input. When a feature table is included as input, the input seqs will be filtered to only include features present in the feature table.
You are correct, this could also be the cause of your KeyError. That error can be pretty cryptic… it will only be detected when the feature missing from the taxonomy file is used for a classification, so might not be caught in any of your test runs (unless if, e.g., you test that all sequences are classified when aligning against the same file).
So the source format should be right since there is no header. The error received from the second run is consistent wit the first error message, just the missing feature is another one.
I have filtered the data now using the newly introduced way with just two commands and no work-around with a tsv file. I keep trying to figure this out…
Also, just to clarify: Can we exclude any problems potentially caused by the combination of ID thresholds in clustering and classification? In both processes, 99% ref seqs and taxonomy was used, but it was clustered at 97% and classified at 94%. In my understanding that should be compatible - so just to check.
Yes, since the import command was correct, it sounds like maybe the issue is coming from that workaround TSV (potentially not being imported correctly as you indicated). It might be worth checking if/where the missing features are in those TSVs to diagnose where this is occurring... though if the error goes aways now that you are not using that workaround, maybe it is best to just leave it at that!
Yes. As long as the reference taxonomy and sequences are consistent, nothing else should matter. This KeyError concerns reference sequences that are not found in the reference taxonomy. The OTU clustering threshold in your query sequences (97%) and the percent identity threshold for finding a match during classification (94%) should have absolutely no effect (though theoretically the error could appear/disappear when you toggle perc-identity just because the classifier is finding/excluding different matches based on this threshold and other parameters, giving the illustion that it is caused by these parameters... it is not )
Please let us know if using filter-seqs with a table input fixes your issue or not! Good luck!
@Nicholas_Bokulich sorry I just realized I have not given you feedback on this yet.
I solved the problem… it was one of those “stupid” problems where you overlook something obvious, so dont judge me on this:
In the SILVA reference seqs and taxonomy folder for qiime2, there are the reference files for all sequences and restricted to 16S only. Now, where as the otus.fasta files are distinguishable based on their file name otus_16S.fasta, the taxonomy files from both groups are not! So, the majority_taxonomy_all_levels.txt has the identical name in the 16S_only folder and the taxonomy_all folder. I combined the wrong target sequences of reference seqs and taxonomy and so an identifier from the ref seqs showed up that was not found in the (more limited) taxonomy repertoire.
I hope nobody else makes that mistake - take care of which files you chose from the SILVA directory!