creating a classifier for lepidopterans

I am working on a project trying to identify leaf miners that are all butterflies. I build a classifier for the order Lepidoptera following this tutorial (Building a COI database from NCBI references) but I didn't do any filtering because I wasn't sure what criteria to apply. I went ahead and used it and I got about 70% of my assignments down to the species level. I am trying to improve on that so I tried to use an existing classifiers for arthropods "bold_full_ArthOnly_classifier.qza" to classify my sequences and qiime2 gave me an error saying that the scikeatlearn plugins used to create the classifier was incompatible with mine. I then tried to rebuild the classifier in qiime using the following script:

qiime rescript evaluate-fit-classifier
--i-sequences NCBIdata_notBOLD_seq.qza
--i-taxonomy NCBIdata_notBOLD_tax.qza
--p-reads-per-batch 6000
--p-n-jobs 6
--output-dir NCBIdata_notBOLD_Arthopod
I got the sequence and taxonomy files from @devonorourke CO1 database in OSF | Devon O'Rourke
It returned an error saying the following: " **The taxonomy IDs must be a superset of the sequence IDs. The following feature IDs are missing from the sequences: MT251879.1, MT250343.1, DI201847.1, DI201848.1, DI201846.1, DI201845.1, M27461.1, DI201849.1"

I am assuming the sequence and taxonomy files do not correspond completely.

Any help would be appreciated to fix this issue. Since I am targetting moths specifically, would it make more sense to continue with the classifier for lepidoptera, filter it and then use it again. If so what would be sound criteria, homopolymers, lenght of amplicon, primers...?
thank you for your help in advance.
Fernando

Hi @nietof ,

I think this thread will help:

You'll see suggestions on how to remove sequences that do not have a matching ID in the taxonomy file. Then you should be able to use that filtered file as input to train your classifier.

-Mike

1 Like