I have generated my own custom reference fish database and am trying to use that in classify consensus-blast. I am using QIIME2 docker image 2018.2 (I haven't upgraded to 2018.6 yet..)
When I go to run consensus classify blast, I get the below error message:
qiime feature-classifier classify-consensus-blast \
--i-query ./rep-seqs-dada2-paired.qza \
--i-reference-reads ./cytb_GL_06252018.qza \
--i-reference-taxonomy ./cytb_GL_taxonomy_06252018.qza \
--p-evalue 1e-11 \
--p-strand plus \
--p-maxaccepts 1 \
--p-perc-identity 0.99 \
--output-dir ./consensus-blast \
--verbose
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.
Command: blastn -query /tmp/qiime2-archive-_il529t8/059f5b6b-635a-4e5f-a557-97265c53490e/data/dna-sequences.fasta -evalue 1e-11 -strand plus -outfmt 7 -subject /tmp/qiime2-archive-l9a6iprd/4395c55d-dc6c-4d84-8956-088eb77e6f3f/data/dna-sequences.fasta -perc_identity 99.0 -max_target_seqs 1 -out /tmp/tmp1rytl1ta
Plugin error from feature-classifier:
'Identifier MF621736.1 was reported in taxonomic search results, but was not present in the reference taxonomy.'
What does this mean? I have checked both the fasta and taxonomy files and verified that the accession number in question is present in both files (highlighted terms in both figures above. This error has also occurred in at least 10 other accession numbers also, all of which are present in both fasta and taxonomy files... I did not have this issue with the previous database that I worked on.
custom says it all β there is probably a minor formatting issue (e.g., incorrect line breaks) that is the culprit
But don't worry β it is probably an easy fix.
The quickest thing to do would probably be to just send me your database and I can check out the formatting; but I have a few follow-up questions below that could also help diagnose.
I am confused β you say below that you highlighted the matching accession #s but they obviously do not match in this example.
Only 1 error appears in your example. When/how are these other errors occurring? Are you running the database on different query files to get the different errors, or are you removing query and/or reference sequences and re-running to get these other errors?
To generate these taxonomy strings, I actually took the accession numbers and plugged them into the taxonomizr R package. That R package gave me the taxonomy IDs and lineage. The taxonomy ID numbers are associated with the input accession numbers. I have re-ran and verified that these accession numbers match the taxonomy ID numbers and lineages.
The same error is occurring for different reference queries each time I remove the previous reference sequence and taxonomy string that gave an error.
This just isn't what is shown above; the reference sequences and fasta in QIIME 2 need to have matching IDs but the fasta you showed still has NCBI accession #s.
I can't tell what exactly taxonomizr outputs β from a brief scan of the vignette it looks like maybe it produces a sql database mapping NCBI accession numbers to new taxonomy IDs? The fasta needs to be relabeled to contain the same IDs as the taxonomy file.
But perhaps I misunderstand what taxonomizr actually does. Can you either share your fasta and taxonomy files, or show me examples where the IDs actually match and cannot be found?