Taxonomy assignment with custom COI database - odd/wrong classifications?

elaine-shen · April 16, 2021, 9:38pm

Hello all,

I am running a custom COI database from BOLD that took a few days on my HPC to extract reads and train, following the instructions on the qiime2 docs. When I run feature-classifier classify-sklearn on my data (marine water samples), almost all of the sequences come back as unidentified Arthropods, or oddly, birds. I am a bit stuck here, as I tried this custom database to compare with Midori (and the same steps with Midori seemed to give back reasonable results - though left more to be desired, which is why I was trying this other database).

For example, the first sequence here unambiguously blasts to Cephalopholis cyanostigma on BLAST that is getting the following assignment as an insect. When I copy a portion this sequence and search for it in the databasse fasta file, I am able to locate Cephalopholis cyanostigma sequences, or at least reference sequences to the genus.

Feature ID	Taxon	Confidence
#q2:types	categorical	categorical
TAGCCGGCAACCTGGCTCATGCAGGCGCTTCCGTTGATTTAACAATCTTTTCACTACATTTAGCAGGTATTTCATCAATTCTAGGGGCAATCAACTTTATCACAACCATTATTAACATGAAACCTCCCGCCATCTCCCAATACCAAACACCCCTGTTTGTATGGGCTGTATTAATTACAGCTGTCCTTCTTCTTCTTTCCCTCCCCGTTCTCGCTGCAGGTATTACAATGCTTCTAACTGATCGAAACCTGAACACCACCTTCTTTGACCCAGCTGGTGGAGGAGACCCAATTCTTTATCAACACTTATTT	Eukaryota;Arthropoda;Insecta	0.896227957
TCTTAGTCACATTACAAGTCACTCAGGAGGGGCTGTAGACTTAGCAATTTTTAGCTTACACCTTTCAGGGGCTTCAAGCATTCTTGGAGCAATTAATTTTATTACCACAATTTTTAATATGCGTGGCCCTGGTTTAAGTATGCACAGACTCCCACTTTTTGTTTGGTCTGTTTTAATTACAGCTTTTTTATTACTTTTATCTCTTCCTGTTCTTGCAGGAGCTATTACAATGCTTTTAACGGACAGAAATTTTAATACTTCTTTTTTTGATCCAGCTGGAGGAGGTGATCCGATTTTATTTCAGCACCTTTTT	Eukaryota;Arthropoda;Insecta	0.999923614

Below are the commands I ran:

module load QIIME2/2019.7

qiime feature-classifier classify-sklearn \
--i-classifier crux_classifier.qza \
--i-reads combined-seqtab-rep-seqs.qza \
--o-classification combined-taxonomy-crux-v2.qza \
--verbose

qiime metadata tabulate \
--m-input-file combined-taxonomy-crux-v2.qza \
--o-visualization combined-taxonomy-crux-v2.qzv

The majority of my sequences are the expected length (313 bp):

Sequence Count	Min Length	Max Length	Mean Length	Range	Standard Deviation
41017	190	318	309.2	128	17.66

I am observing similar behavior when I tried to run the RDP classifier using DADA2, by the way: AssignTaxonomy() using custom COI database yields Arthropods or NA's · Issue #1318 · benjjneb/dada2 · GitHub

Any suggestions for why this is happening? Scratching my head over here, and I don't think this has come up in a previous forum question?

SoilRotifer · April 18, 2021, 9:51pm

Hi @elaine-shen,

There could be a variety of reasons for this. Most common is inconsistent taxonomic annotations which 'confuse' the classifier...

I would suggest trying out the CO1 reference databases from @devonorourke:
BOLD references:

The files are available here:

and the NCBI approach here:

There is another CO1 reference database maintained by the Porter Lab available here.

I think it would be a good to compare these reference databases to your existing one.

-Cheers!
-Mike

elaine-shen · April 19, 2021, 1:21pm

I'll give these databases a shot - thanks! @devonorourke and I have certainly crossed paths on the forum - thanks for both of y'alls hard work!

For completeness, here are the first few lines of the fasta and taxonomy files, in case there are formatting issues I did not catch (though I suspect this isn't the problem, as I was able to run qiime feature-classifier fit-classifier-naive-bayes with no issues):

>LACM:DISCO:7833
TTTGTCTAGAAACCTAGCTCATATAGGTGGGTCTGTAGATTTAGCTATTTTTTCTCTTCATTTAGCAGGGGCTTCGTCAATTTTAGGTGCGGTAAATTTTATTACTACCGTAACTAACATGCGATGGGCAGGGATGCAATGAGAGCGCCTTACTTTATTTACTTGGTCTGTAAAAATTACTGCTGTTTTGCTTCTTTTGTCTCTTCCAGTTTTAGCCGGTGCAATTACAATATTACTAACGGACCGTAATTTTAATACTGCCTTTTTTGACCCTGCGGGAGGGGGGGACCCCGTACTATACCAGCATCTGTTT
>LACM:DISCO:7831
CCTATCATCAGGTATTGCTCACGGGGGGGCTTCAGTAGATTTAGCTATTTTTAGATTACATTTAGCGGGAATCTCATCAATTTTAGGGGCTGTGAATTTCATTACTACAATTATTAATATACGATCTGTTGGAATAACTTTTGATCGAATACCATTATTTGTGTGATCAGTAGGAATTACAGCACTATTATTACTTTTATCTYTACCTGTATTAGCGGGAGCTATTACAATATTATTAACTGATCGAAATTTAAATACTTCATTTTTTGATCCGGCGGGAGGGGGAGACCCTATTCTCTATCAACATTTATTT

LACM:DISCO:5659	Eukaryota;Arthropoda;Branchiopoda;Anostraca;Branchinectidae;Branchinecta;Branchinecta lindahli
LACM:DISCO:5661	Eukaryota;Arthropoda;Branchiopoda;Anostraca;Branchinectidae;Branchinecta;Branchinecta lindahli

SoilRotifer · April 19, 2021, 1:44pm

Hi @elaine-shen,

Yeah I do not think it is a formatting issue, there are often 'baked-in' miss-annotations or improper curation of the respective databases were this information is downloaded. For example, many bad references with ambiguous bases, too short, etc... Some of which is referenced here:

system · May 20, 2021, 7:45pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.