Identifier not present in reference taxonomy error while running classify-consensus-blast

ylor · June 25, 2018, 7:49pm

Hello,

I have generated my own custom reference fish database and am trying to use that in classify consensus-blast. I am using QIIME2 docker image 2018.2 (I haven't upgraded to 2018.6 yet..)

Here is what my fasta file looks like:

Here is what my taxonomy file looks like (these taxonomy IDs don't necessarily match the accession numbers above):

Both of these files for the cytb gene have been imported as QIIME2 artifacts with no error messages

qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path ./cytb_GL_06252018.fasta \
--output-path cytb_GL_06252018.qza

qiime tools import \
--type 'FeatureData[Taxonomy]' \
--source-format HeaderlessTSVTaxonomyFormat \
--input-path ./cytb_GL_taxonomy_06252018.txt \
--output-path ./cytb_GL_taxonomy_06252018.qza

When I go to run consensus classify blast, I get the below error message:

qiime feature-classifier classify-consensus-blast \
--i-query ./rep-seqs-dada2-paired.qza \
--i-reference-reads ./cytb_GL_06252018.qza \
--i-reference-taxonomy ./cytb_GL_taxonomy_06252018.qza \
--p-evalue 1e-11 \
--p-strand plus \
--p-maxaccepts 1 \
--p-perc-identity 0.99 \
--output-dir ./consensus-blast \
--verbose

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: blastn -query /tmp/qiime2-archive-_il529t8/059f5b6b-635a-4e5f-a557-97265c53490e/data/dna-sequences.fasta -evalue 1e-11 -strand plus -outfmt 7 -subject /tmp/qiime2-archive-l9a6iprd/4395c55d-dc6c-4d84-8956-088eb77e6f3f/data/dna-sequences.fasta -perc_identity 99.0 -max_target_seqs 1 -out /tmp/tmp1rytl1ta

Plugin error from feature-classifier:

  'Identifier MF621736.1 was reported in taxonomic search results, but was not present in the reference taxonomy.'

What does this mean? I have checked both the fasta and taxonomy files and verified that the accession number in question is present in both files (highlighted terms in both figures above. This error has also occurred in at least 10 other accession numbers also, all of which are present in both fasta and taxonomy files... I did not have this issue with the previous database that I worked on.

Any suggestions and comments are welcome!

Thanks!

Nicholas_Bokulich · June 25, 2018, 8:15pm

custom says it all — there is probably a minor formatting issue (e.g., incorrect line breaks) that is the culprit

But don't worry — it is probably an easy fix.

The quickest thing to do would probably be to just send me your database and I can check out the formatting; but I have a few follow-up questions below that could also help diagnose.

I am confused — you say below that you highlighted the matching accession #s but they obviously do not match in this example.

Only 1 error appears in your example. When/how are these other errors occurring? Are you running the database on different query files to get the different errors, or are you removing query and/or reference sequences and re-running to get these other errors?

Thanks!

ylor · June 27, 2018, 12:40pm

Thanks for getting back to me.

To generate these taxonomy strings, I actually took the accession numbers and plugged them into the taxonomizr R package. That R package gave me the taxonomy IDs and lineage. The taxonomy ID numbers are associated with the input accession numbers. I have re-ran and verified that these accession numbers match the taxonomy ID numbers and lineages.

The same error is occurring for different reference queries each time I remove the previous reference sequence and taxonomy string that gave an error.

Nicholas_Bokulich · June 27, 2018, 2:41pm

Thanks for clarifying

This just isn't what is shown above; the reference sequences and fasta in QIIME 2 need to have matching IDs but the fasta you showed still has NCBI accession #s.

I can't tell what exactly taxonomizr outputs — from a brief scan of the vignette it looks like maybe it produces a sql database mapping NCBI accession numbers to new taxonomy IDs? The fasta needs to be relabeled to contain the same IDs as the taxonomy file.

But perhaps I misunderstand what taxonomizr actually does. Can you either share your fasta and taxonomy files, or show me examples where the IDs actually match and cannot be found?

ylor · June 27, 2018, 2:51pm

I couldn't share the fasta file with you because it is too large.

I will re-label the taxonomy file so that it has the accession numbers and re-test that. Thanks.

system · July 28, 2018, 8:51pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.