Feature-classifier says an identifier was not present in the reference taxonomy

Erin_dfo · March 14, 2018, 9:04pm

Hi again,

I am new to Qiime2 and working on matching my OTUs to my reference sequences. I am working with 12S sequence data from fish, and I don’t think there’s a database like Greengenes for that, so I had to create my own custom reference sequences and reference taxonomy.

Importing the reference database and taxonomy went fine. Here are the commands I used:

qiime tools import --input-path nrdatabase_20180109_qiime2.fasta --output-path referenceseqs.qza --type ‘FeatureData[Sequence]’

qiime tools import --type FeatureData[Taxonomy] --source-format TSVTaxonomyFormat --input-path qiime2taxonomy2.txt --output-path referencetaxonomy.qza

I then tried to match my OTUs to my reference sequences using this command:

qiime feature-classifier classify-consensus-blast --i-query Teleo_OTUs_97_sequence.qza --i-reference-reads referenceseqs.qza --i-reference-taxonomy referencetaxonomy.qza --p-maxaccepts 10 --p-perc-identity 0.9 --o-classification Teleo_OTUs_97_classifications --verbose

And this is the output I got:

Command: blastn -query /tmp/qiime2-archive-f7b9g7dh/9190d430-896d-4c8a-b296-3101a3bdf254/data/dna-sequences.fasta -evalue 0.001 -strand both -outfmt 7 -subject /tmp/qiime2-archive-kygwgb7f/c79a9c41-5b87-46ec-93bc-189fe66065e2/data/dna-sequences.fasta -perc_identity 90.0 -max_target_seqs 10 -out /tmp/tmptqv7y665

Plugin error from feature-classifier:

‘Identifier NC_020760 was reported in taxonomic search results, but was not present in the reference taxonomy.’

Because this is a custom database for only our species of interest, the files are quite small, and I was able to look through them manually and confirm that this identifier is present in both the reference sequences and reference taxonomy. The labels match exactly (NC_020760 Coregonus_nasus), and there are other sequences with underscores and spaces in the names that don’t seem to be causing problems, so I am not sure what I’ve done wrong.

Any ideas how I can make this work?

Thanks!

Erin

Erin_dfo · March 14, 2018, 9:10pm

And in case there is a formatting issue I haven't spotted, here are snippets from the reference taxonomy and sequences:

Erin_dfo · March 15, 2018, 4:27pm

Update: I reformatted the reference sequence and taxonomy files so that there were no spaces or underscores in the identifiers, and it ran fine. This is a little inconvenient for downstream analyses, so I’d still be interested in hearing if there is another fix.

Nicholas_Bokulich · March 15, 2018, 5:07pm

The taxonomy format that we use follow’s qiime1 taxonomy format, which should only have the accession #, not any space-delimited taxonomic information (e.g., species name) in the name. After all, the species information in already provided in the taxonomy string, so it is redundant in the sequence ID.

While conventions like this are inconvenient, they allow us to ensure that many different downstream analyses that work with taxonomy metadata — and behave in different ways (often because QIIME2 is wrapping external tools that are developed by others) — will work consistently.

I think the best solution for this is to reinforce this rule when data are imported — then a clear error message will appear explaining why the data failed to import. I have raised this issue to get this fixed in a future release of QIIME2.

Thanks!

system · April 15, 2018, 11:07pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.