Hello all,
I am running a custom COI database from BOLD that took a few days on my HPC to extract reads and train, following the instructions on the qiime2 docs. When I run feature-classifier classify-sklearn
on my data (marine water samples), almost all of the sequences come back as unidentified Arthropods, or oddly, birds. I am a bit stuck here, as I tried this custom database to compare with Midori (and the same steps with Midori seemed to give back reasonable results - though left more to be desired, which is why I was trying this other database).
For example, the first sequence here unambiguously blasts to Cephalopholis cyanostigma on BLAST that is getting the following assignment as an insect. When I copy a portion this sequence and search for it in the databasse fasta file, I am able to locate Cephalopholis cyanostigma sequences, or at least reference sequences to the genus.
Feature ID
Taxon
Confidence
#q2:types
categorical
categorical
TAGCCGGCAACCTGGCTCATGCAGGCGCTTCCGTTGATTTAACAATCTTTTCACTACATTTAGCAGGTATTTCATCAATTCTAGGGGCAATCAACTTTATCACAACCATTATTAACATGAAACCTCCCGCCATCTCCCAATACCAAACACCCCTGTTTGTATGGGCTGTATTAATTACAGCTGTCCTTCTTCTTCTTTCCCTCCCCGTTCTCGCTGCAGGTATTACAATGCTTCTAACTGATCGAAACCTGAACACCACCTTCTTTGACCCAGCTGGTGGAGGAGACCCAATTCTTTATCAACACTTATTT
Eukaryota;Arthropoda;Insecta
0.896227957
TCTTAGTCACATTACAAGTCACTCAGGAGGGGCTGTAGACTTAGCAATTTTTAGCTTACACCTTTCAGGGGCTTCAAGCATTCTTGGAGCAATTAATTTTATTACCACAATTTTTAATATGCGTGGCCCTGGTTTAAGTATGCACAGACTCCCACTTTTTGTTTGGTCTGTTTTAATTACAGCTTTTTTATTACTTTTATCTCTTCCTGTTCTTGCAGGAGCTATTACAATGCTTTTAACGGACAGAAATTTTAATACTTCTTTTTTTGATCCAGCTGGAGGAGGTGATCCGATTTTATTTCAGCACCTTTTT
Eukaryota;Arthropoda;Insecta
0.999923614
Below are the commands I ran:
module load QIIME2/2019.7
qiime feature-classifier classify-sklearn \
--i-classifier crux_classifier.qza \
--i-reads combined-seqtab-rep-seqs.qza \
--o-classification combined-taxonomy-crux-v2.qza \
--verbose
qiime metadata tabulate \
--m-input-file combined-taxonomy-crux-v2.qza \
--o-visualization combined-taxonomy-crux-v2.qzv
The majority of my sequences are the expected length (313 bp):
Sequence Count
Min Length
Max Length
Mean Length
Range
Standard Deviation
41017
190
318
309.2
128
17.66
I am observing similar behavior when I tried to run the RDP classifier using DADA2, by the way: AssignTaxonomy() using custom COI database yields Arthropods or NA's · Issue #1318 · benjjneb/dada2 · GitHub
Any suggestions for why this is happening? Scratching my head over here, and I don't think this has come up in a previous forum question?
Hi @elaine-shen ,
There could be a variety of reasons for this. Most common is inconsistent taxonomic annotations which 'confuse' the classifier...
I would suggest trying out the CO1 reference databases from @devonorourke :
BOLD references:
Citations:
If you use the following COI resources or RESCRIPt for COI database preparation, please cite the following:
Michael S Robeson II, Devon R O’Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. RESCRIPt: Reproducible sequence taxonomy reference database management for the masses. bioRxiv 2020.10.05.326504;…
The files are available here:
Hi @smayne11 , thanks for your question.
I think you should give the ANML classifier a shot. I'd be curious to see how many of your sequence variants are classified (or unclassified), and among those with some taxonomic labels, what fraction are being assigned Family, Genus, or Species-level information.
If the classifier isn't working for you and you want to generate your primer-specific classifier starting from the broader BOLD sequence and taxonomy files now hosted by QIIME2 here:
https://…
and the NCBI approach here:
Citation:
If you use the following COI resources or RESCRIPt for COI database preparation, please cite the following:
Michael S Robeson II, Devon R O’Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. RESCRIPt: Reproducible sequence taxonomy reference database management for the masses. bioRxiv 2020.10.05.326504; d…
There is another CO1 reference database maintained by the Porter Lab available here .
I think it would be a good to compare these reference databases to your existing one.
-Cheers!
-Mike
1 Like
I'll give these databases a shot - thanks! @devonorourke and I have certainly crossed paths on the forum - thanks for both of y'alls hard work!
For completeness, here are the first few lines of the fasta and taxonomy files, in case there are formatting issues I did not catch (though I suspect this isn't the problem, as I was able to run qiime feature-classifier fit-classifier-naive-bayes
with no issues):
>LACM:DISCO:7833
TTTGTCTAGAAACCTAGCTCATATAGGTGGGTCTGTAGATTTAGCTATTTTTTCTCTTCATTTAGCAGGGGCTTCGTCAATTTTAGGTGCGGTAAATTTTATTACTACCGTAACTAACATGCGATGGGCAGGGATGCAATGAGAGCGCCTTACTTTATTTACTTGGTCTGTAAAAATTACTGCTGTTTTGCTTCTTTTGTCTCTTCCAGTTTTAGCCGGTGCAATTACAATATTACTAACGGACCGTAATTTTAATACTGCCTTTTTTGACCCTGCGGGAGGGGGGGACCCCGTACTATACCAGCATCTGTTT
>LACM:DISCO:7831
CCTATCATCAGGTATTGCTCACGGGGGGGCTTCAGTAGATTTAGCTATTTTTAGATTACATTTAGCGGGAATCTCATCAATTTTAGGGGCTGTGAATTTCATTACTACAATTATTAATATACGATCTGTTGGAATAACTTTTGATCGAATACCATTATTTGTGTGATCAGTAGGAATTACAGCACTATTATTACTTTTATCTYTACCTGTATTAGCGGGAGCTATTACAATATTATTAACTGATCGAAATTTAAATACTTCATTTTTTGATCCGGCGGGAGGGGGAGACCCTATTCTCTATCAACATTTATTT
LACM:DISCO:5659
Eukaryota;Arthropoda;Branchiopoda;Anostraca;Branchinectidae;Branchinecta;Branchinecta lindahli
LACM:DISCO:5661
Eukaryota;Arthropoda;Branchiopoda;Anostraca;Branchinectidae;Branchinecta;Branchinecta lindahli
Hi @elaine-shen ,
Yeah I do not think it is a formatting issue, there are often 'baked-in' miss-annotations or improper curation of the respective databases were this information is downloaded. For example, many bad references with ambiguous bases, too short, etc... Some of which is referenced here:
Hi @clairewill22 ,
Sorry to hear you're running into issues!
How are you training your own classifier? can you share your command? Make sure you use the min and max length parameters with appropriate thresholds... a common issue with training SILVA classifiers specifically is that junk sequences left in SILVA (e.g., with lots of ambiguous bases) can cause hits to disparate kingdoms, causing the classifier to get confused (search "hot spring metagenome" in the forum archive for some examples!).
…
1 Like
system
(system)
Closed
May 20, 2021, 7:45pm
7
This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.