Making insect classifier

graceb · May 17, 2025, 3:12pm

Hi, I am trying to make an insect CO1 classifier however it is taking a very long time to do! I am currently using the bold_derep1_seqs.qza I got from the Building a COI database from BOLD references tutorial. I then ran this code to make the classifier specific for my primers:

qiime feature-classifier extract-reads \
--i-sequences bold_derep1_seqs.qza
--p-f-primer GGWACWGGWTGAACWGTWTAYCCYCC
--p-r-primer TANACYTCNGGRTGNCCRAARAAYCA
--p-n-jobs 2
--p-read-orientation 'forward'
--o-reads sequencesampliconspecific.qza

However, as mentioned, it is taking a while to do so. I understand the database is large but I am getting worried that it is not doing anything / I have done something wrong. I have been running it for over 24 hours now. Is there an easier way to get / make a CO1 classifier? Thanks!

SoilRotifer · May 17, 2025, 4:23pm

Hi @graceb,

It can certainly take a long while to build a classifier, sometimes a few days. Do you know how many sequences you have? Have you performed any QA/QC like dereplication?

graceb · May 17, 2025, 5:17pm

Hi,

Thank you so much for your response! The sequences had these things done to them before I downloaded them:

"The raw BOLD sequences were initially filtered for ambiguous nucleotide content (5 or more N 's), long homopolymer runs (12 or more), very short (< 250 bp) or very long (> 1600 bp) sequences, and dereplicated."

Therefore, I was going to make it specific for my primers, then dereplicate again and build the classifier.

I am not sure how many sequences there are, how would I be able to tell?

Again thanks for your help!

SoilRotifer · May 17, 2025, 5:43pm

Hi @graceb,

Great!

That is what I often do myself.

You can simply run the following to tabulate your sequences and see how many there are:

qiime feature-table tabulate-seqs \
    --i-data sequencesampliconspecific.qza \
    --o-visualization sequencesampliconspecific.qzv

graceb · May 17, 2025, 6:27pm

Amazing thanks. For the last bit I tried to run the code and visualise the output on qiime2 view however there is nothing on the screen?

SoilRotifer · May 17, 2025, 6:33pm

Depending on how many sequences are in the file... it can take a while to load as a qzv.

Perhaps try this instead:

# export FASTA
qiime tools export \
    --input-path sequencesampliconspecific.qza \
    --output-path sequencesampliconspecific_export

# count number of sequences
grep -c '^>'  sequencesampliconspecific_export/dna-sequences.fasta

graceb · May 17, 2025, 7:11pm

That worked great! There are 1718762 sequences - which definitely sounds like a lot!

SoilRotifer · May 17, 2025, 7:42pm

Holy smokes!

Are these dereplicated? If not try that. Otherwise, when dereplicating try setting --p-perc-identity 99 to perform some minor clustering. Might even have to try 98%... as even if you can construct the classifier it might take a lot of RAM to use it...

graceb · May 17, 2025, 7:49pm

Ok thanks I will give that a go!

Is there another, easier, way to make an insect CO1 classifier? I tried to use a pre-made one but unfortunately it didn't work with the version of qiime2 I am using...

SoilRotifer · May 17, 2025, 7:53pm

If you have access to the sequence and taxonomy files, you should be able to build the classifier yourself for your version of QIIME 2. Off the top of my head, I am unaware of any exiting pre-compiled files.

Also, it is quite okay to cluster the sequences if needed. There is a measure of practicality to constructing databases. Also, you can try theclassify-consensus-vsearch and classify-consensus-blast if the classify-sklearn becomes untenable.

graceb · May 18, 2025, 8:00pm

Ok thank you so much! I have dereplicated the sequences to an identity of 98% and managed to almost reduce the number of sequences a lot (now 448211!). Hopefully this will be quicker when trimming according to my primers. Thanks again for your help!