I am running a custom COI database from BOLD that took a few days on my HPC to extract reads and train, following the instructions on the qiime2 docs. When I run
feature-classifier classify-sklearn on my data (marine water samples), almost all of the sequences come back as unidentified Arthropods, or oddly, birds. I am a bit stuck here, as I tried this custom database to compare with Midori (and the same steps with Midori seemed to give back reasonable results - though left more to be desired, which is why I was trying this other database).
For example, the first sequence here unambiguously blasts to Cephalopholis cyanostigma on BLAST that is getting the following assignment as an insect. When I copy a portion this sequence and search for it in the databasse fasta file, I am able to locate Cephalopholis cyanostigma sequences, or at least reference sequences to the genus.
Below are the commands I ran:
module load QIIME2/2019.7 qiime feature-classifier classify-sklearn \ --i-classifier crux_classifier.qza \ --i-reads combined-seqtab-rep-seqs.qza \ --o-classification combined-taxonomy-crux-v2.qza \ --verbose qiime metadata tabulate \ --m-input-file combined-taxonomy-crux-v2.qza \ --o-visualization combined-taxonomy-crux-v2.qzv
The majority of my sequences are the expected length (313 bp):
|Sequence Count||Min Length||Max Length||Mean Length||Range||Standard Deviation|
I am observing similar behavior when I tried to run the RDP classifier using DADA2, by the way: AssignTaxonomy() using custom COI database yields Arthropods or NA's · Issue #1318 · benjjneb/dada2 · GitHub
Any suggestions for why this is happening? Scratching my head over here, and I don’t think this has come up in a previous forum question?