Hi, I am trying to make an insect CO1 classifier however it is taking a very long time to do! I am currently using the bold_derep1_seqs.qza I got from the Building a COI database from BOLD references tutorial. I then ran this code to make the classifier specific for my primers:
However, as mentioned, it is taking a while to do so. I understand the database is large but I am getting worried that it is not doing anything / I have done something wrong. I have been running it for over 24 hours now. Is there an easier way to get / make a CO1 classifier? Thanks!
It can certainly take a long while to build a classifier, sometimes a few days. Do you know how many sequences you have? Have you performed any QA/QC like dereplication?
Thank you so much for your response! The sequences had these things done to them before I downloaded them:
"The raw BOLD sequences were initially filtered for ambiguous nucleotide content (5 or more N 's), long homopolymer runs (12 or more), very short (< 250 bp) or very long (> 1600 bp) sequences, and dereplicated."
Therefore, I was going to make it specific for my primers, then dereplicate again and build the classifier.
I am not sure how many sequences there are, how would I be able to tell?
Are these dereplicated? If not try that. Otherwise, when dereplicating try setting --p-perc-identity 99 to perform some minor clustering. Might even have to try 98%... as even if you can construct the classifier it might take a lot of RAM to use it...
Is there another, easier, way to make an insect CO1 classifier? I tried to use a pre-made one but unfortunately it didn't work with the version of qiime2 I am using...
If you have access to the sequence and taxonomy files, you should be able to build the classifier yourself for your version of QIIME 2. Off the top of my head, I am unaware of any exiting pre-compiled files.
Also, it is quite okay to cluster the sequences if needed. There is a measure of practicality to constructing databases. Also, you can try theclassify-consensus-vsearch and classify-consensus-blast if the classify-sklearn becomes untenable.
Ok thank you so much! I have dereplicated the sequences to an identity of 98% and managed to almost reduce the number of sequences a lot (now 448211!). Hopefully this will be quicker when trimming according to my primers. Thanks again for your help!