Finding incorrectly formatted needle in apparently mostly formatted properly haystack

Good morning QIIME folks,
The problem
This question relates to looking at a barplot, realizing a taxonomic identifier is being labeled incorrectly, and trying to figure out why it’s incorrectly labeled. The incorrect label is rare, however, and I can’t find the label that is assigned in that barplot in any file that I’ve used.

The process
I’ve constructed a custom COI database with records from both the Barcode of Life Database (BOLD) and select Genbank records. Following the tutorials and posts in this forum I’ve imported a pair of .qza artifacts representing the sequence and taxonomy strings needed to run the various feature-classifier options to assign taxonomic identities (and confidences) to my sequences. The importing process was as follows:

qiime tools import --type 'FeatureData[Sequence]' --input-path my.qiimeCOI.fa --output-path ref_seqs.qza

qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat --input-path my.qiimeCOI.txt --output-path ref_taxonomy.qza

In addition, I imported an OTU table (.biom format) and a dereplicated, denoised, clustered fasta file containing my representative sequences:

qiime tools import --input-path my.OTUtable.biom --type 'FeatureTable[Frequency]' --output-path my_otutable.qza

qiime tools import --input-path my.unoise.cluster.otus.fa --type 'FeatureData[Sequence]' --output-path my_seq.qza

I then assigned assigned taxonomic information with Vsearch:

qiime feature-classifier classify-consensus-vsearch --i-query my_seq.qza --i-reference-reads ref_seqs.qza --i-reference-taxonomy ref_taxonomy.qza --p-maxaccepts 500 --p-perc-identity 0.7 --p-strand both --p-threads 4 --o-classification my_tax_Vsearch.qza

Finally, I created the barplot:
qiime taxa barplot --i-table my_otutable.qza --i-taxonomy my_taxVsearch.qza --m-metadata-file my-metadata.txt --o-visualization my_bplot

When I take a look at the resulting my_barplot.qzv file in view.qiime2.org, I noticed that there were four categorical variables assigned at Level1: “k__animalia” (expected!), “Unassigned” (expected!), and two additional:

  • “k__GAATTGGGACAGCCAGGCGCTCTTTTGGGGGACGATCAGATTTATAACGTGATTGTAACTGCTCATGCGT”
  • “k__GAATTAGGCCAACCAGGGGCCCTACTCGGAGATGATCAGATTTATAATGTAATTGTCACCGCTCATGCAT”

If I zoom in all the way to Level 7, I get hundreds of potential hits (nice!), yet those two unexpected labels are maintained. This is to say, that weird label is applied twice, and only twice, across all taxonomic levels.

I’ve tried looking back into the original .txt and .fa files I used to import but can’t find those sequences anywhere. I’m wondering how else folks might suggest debugging to ascertain why those two samples are popping up?

Many thanks

1 Like