Taxonomy assignment with custom COI database - odd/wrong classifications?

Hello all,

I am running a custom COI database from BOLD that took a few days on my HPC to extract reads and train, following the instructions on the qiime2 docs. When I run feature-classifier classify-sklearn on my data (marine water samples), almost all of the sequences come back as unidentified Arthropods, or oddly, birds. I am a bit stuck here, as I tried this custom database to compare with Midori (and the same steps with Midori seemed to give back reasonable results - though left more to be desired, which is why I was trying this other database).

For example, the first sequence here unambiguously blasts to Cephalopholis cyanostigma on BLAST that is getting the following assignment as an insect. When I copy a portion this sequence and search for it in the databasse fasta file, I am able to locate Cephalopholis cyanostigma sequences, or at least reference sequences to the genus.

Feature ID Taxon Confidence
#q2:types categorical categorical
TAGCCGGCAACCTGGCTCATGCAGGCGCTTCCGTTGATTTAACAATCTTTTCACTACATTTAGCAGGTATTTCATCAATTCTAGGGGCAATCAACTTTATCACAACCATTATTAACATGAAACCTCCCGCCATCTCCCAATACCAAACACCCCTGTTTGTATGGGCTGTATTAATTACAGCTGTCCTTCTTCTTCTTTCCCTCCCCGTTCTCGCTGCAGGTATTACAATGCTTCTAACTGATCGAAACCTGAACACCACCTTCTTTGACCCAGCTGGTGGAGGAGACCCAATTCTTTATCAACACTTATTT Eukaryota;Arthropoda;Insecta 0.896227957
TCTTAGTCACATTACAAGTCACTCAGGAGGGGCTGTAGACTTAGCAATTTTTAGCTTACACCTTTCAGGGGCTTCAAGCATTCTTGGAGCAATTAATTTTATTACCACAATTTTTAATATGCGTGGCCCTGGTTTAAGTATGCACAGACTCCCACTTTTTGTTTGGTCTGTTTTAATTACAGCTTTTTTATTACTTTTATCTCTTCCTGTTCTTGCAGGAGCTATTACAATGCTTTTAACGGACAGAAATTTTAATACTTCTTTTTTTGATCCAGCTGGAGGAGGTGATCCGATTTTATTTCAGCACCTTTTT Eukaryota;Arthropoda;Insecta 0.999923614

Below are the commands I ran:

module load QIIME2/2019.7

qiime feature-classifier classify-sklearn \
--i-classifier crux_classifier.qza \
--i-reads combined-seqtab-rep-seqs.qza \
--o-classification combined-taxonomy-crux-v2.qza \
--verbose

qiime metadata tabulate \
--m-input-file combined-taxonomy-crux-v2.qza \
--o-visualization combined-taxonomy-crux-v2.qzv

The majority of my sequences are the expected length (313 bp):

Sequence Count Min Length Max Length Mean Length Range Standard Deviation
41017 190 318 309.2 128 17.66

I am observing similar behavior when I tried to run the RDP classifier using DADA2, by the way: AssignTaxonomy() using custom COI database yields Arthropods or NA's · Issue #1318 · benjjneb/dada2 · GitHub

Any suggestions for why this is happening? Scratching my head over here, and I don’t think this has come up in a previous forum question?

Hi @elaine-shen,

There could be a variety of reasons for this. Most common is inconsistent taxonomic annotations which ‘confuse’ the classifier…

I would suggest trying out the CO1 reference databases from @devonorourke:
BOLD references:

The files are available here:

and the NCBI approach here:

There is another CO1 reference database maintained by the Porter Lab available here.

I think it would be a good to compare these reference databases to your existing one.

-Cheers!
-Mike

1 Like

I’ll give these databases a shot - thanks! @devonorourke and I have certainly crossed paths on the forum - thanks for both of y’alls hard work!

For completeness, here are the first few lines of the fasta and taxonomy files, in case there are formatting issues I did not catch (though I suspect this isn’t the problem, as I was able to run qiime feature-classifier fit-classifier-naive-bayes with no issues):

>LACM:DISCO:7833
TTTGTCTAGAAACCTAGCTCATATAGGTGGGTCTGTAGATTTAGCTATTTTTTCTCTTCATTTAGCAGGGGCTTCGTCAATTTTAGGTGCGGTAAATTTTATTACTACCGTAACTAACATGCGATGGGCAGGGATGCAATGAGAGCGCCTTACTTTATTTACTTGGTCTGTAAAAATTACTGCTGTTTTGCTTCTTTTGTCTCTTCCAGTTTTAGCCGGTGCAATTACAATATTACTAACGGACCGTAATTTTAATACTGCCTTTTTTGACCCTGCGGGAGGGGGGGACCCCGTACTATACCAGCATCTGTTT
>LACM:DISCO:7831
CCTATCATCAGGTATTGCTCACGGGGGGGCTTCAGTAGATTTAGCTATTTTTAGATTACATTTAGCGGGAATCTCATCAATTTTAGGGGCTGTGAATTTCATTACTACAATTATTAATATACGATCTGTTGGAATAACTTTTGATCGAATACCATTATTTGTGTGATCAGTAGGAATTACAGCACTATTATTACTTTTATCTYTACCTGTATTAGCGGGAGCTATTACAATATTATTAACTGATCGAAATTTAAATACTTCATTTTTTGATCCGGCGGGAGGGGGAGACCCTATTCTCTATCAACATTTATTT
LACM:DISCO:5659 Eukaryota;Arthropoda;Branchiopoda;Anostraca;Branchinectidae;Branchinecta;Branchinecta lindahli
LACM:DISCO:5661 Eukaryota;Arthropoda;Branchiopoda;Anostraca;Branchinectidae;Branchinecta;Branchinecta lindahli

Hi @elaine-shen,

Yeah I do not think it is a formatting issue, there are often ‘baked-in’ miss-annotations or improper curation of the respective databases were this information is downloaded. For example, many bad references with ambiguous bases, too short, etc… Some of which is referenced here:

1 Like