Making insect classifier

Hi, I am trying to make an insect CO1 classifier however it is taking a very long time to do! I am currently using the bold_derep1_seqs.qza I got from the Building a COI database from BOLD references tutorial. I then ran this code to make the classifier specific for my primers:

qiime feature-classifier extract-reads \
--i-sequences bold_derep1_seqs.qza
--p-f-primer GGWACWGGWTGAACWGTWTAYCCYCC
--p-r-primer TANACYTCNGGRTGNCCRAARAAYCA
--p-n-jobs 2
--p-read-orientation 'forward'
--o-reads sequencesampliconspecific.qza

However, as mentioned, it is taking a while to do so. I understand the database is large but I am getting worried that it is not doing anything / I have done something wrong. I have been running it for over 24 hours now. Is there an easier way to get / make a CO1 classifier? Thanks!

Hi @graceb,

It can certainly take a long while to build a classifier, sometimes a few days. Do you know how many sequences you have? Have you performed any QA/QC like dereplication?

Hi,

Thank you so much for your response! The sequences had these things done to them before I downloaded them:

"The raw BOLD sequences were initially filtered for ambiguous nucleotide content (5 or more N 's), long homopolymer runs (12 or more), very short (< 250 bp) or very long (> 1600 bp) sequences, and dereplicated."

Therefore, I was going to make it specific for my primers, then dereplicate again and build the classifier.

I am not sure how many sequences there are, how would I be able to tell?

Again thanks for your help!

Hi @graceb,

Great!

That is what I often do myself. :+1:

You can simply run the following to tabulate your sequences and see how many there are:

qiime feature-table tabulate-seqs \
    --i-data sequencesampliconspecific.qza \
    --o-visualization sequencesampliconspecific.qzv

Amazing thanks. For the last bit I tried to run the code and visualise the output on qiime2 view however there is nothing on the screen?

Depending on how many sequences are in the file... it can take a while to load as a qzv.

Perhaps try this instead:

# export FASTA
qiime tools export \
    --input-path sequencesampliconspecific.qza \
    --output-path sequencesampliconspecific_export

# count number of sequences
grep -c '^>'  sequencesampliconspecific_export/dna-sequences.fasta

That worked great! There are 1718762 sequences - which definitely sounds like a lot!

Holy smokes! :scream:

Are these dereplicated? If not try that. Otherwise, when dereplicating try setting --p-perc-identity 99 to perform some minor clustering. Might even have to try 98%... as even if you can construct the classifier it might take a lot of RAM to use it...

Ok thanks I will give that a go!

Is there another, easier, way to make an insect CO1 classifier? I tried to use a pre-made one but unfortunately it didn't work with the version of qiime2 I am using...

If you have access to the sequence and taxonomy files, you should be able to build the classifier yourself for your version of QIIME 2. Off the top of my head, I am unaware of any exiting pre-compiled files.

Also, it is quite okay to cluster the sequences if needed. There is a measure of practicality to constructing databases. Also, you can try theclassify-consensus-vsearch and classify-consensus-blast if the classify-sklearn becomes untenable.

Ok thank you so much! I have dereplicated the sequences to an identity of 98% and managed to almost reduce the number of sequences a lot (now 448211!). Hopefully this will be quicker when trimming according to my primers. Thanks again for your help!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.