How do I use a pretrained classifier?

Hi, I want to use a pretrained classifier but I am completely new to this and I don't really understand how to get the taxonomy file from a pretrained SILVA classifier. I have looked at the qiime and rescript tutorials but I have some questions about the process as what I have tried so far seems to kill the process on my pc.

  1. There is no specific pretrained classifier for the 16s region I want. Do I download the full classifier and use that?

  2. The full classifier is huge and kills my system when I run the sklearn step. Do I need to extract region-specific reads and dereplicate first or just make the chunks smaller when creating the taxonomy file (see code below).

  3. Do I run the classify-sklearn step on the classifier and train it using on my own own dataset or a dummy dataset?

Do I run something like this to get the taxonomy file:
qiime feature-classifer classify-sklearn
--i-classifier pretrained-classifier.qza
--p-reads-per-batch 5000
--i-reads mydata_rep_seqs.qza
--o-classification taxonomy_silva.qza

  1. Is there a less resource-hungry alternative to the classify-sklearn step to get the taxonomy file?


Good evening!

Welcome to the forums! :qiime2:

You are on the right track and asking all the right questions! First, check out the RESCRIPt tutorial, which is the most complete overview of this process.

Both are good ideas, and RESCRIPt will help you do both :point_down:

Yes, you can avoid the pretraining with a top-hit LCA classifier like classify-consensus-vsearch . All good options!

I'm afraid I've given you much to consider without answering each of your questions. If you have more questions about LCA classifiers or how to use RESCRIPt, let me know.

1 Like

Hi Colin,
thanks for answering so quickly. I've read the RESCRIPt and QIIME2 tutorials but I've some difficulties understanding exactly what I needed to do. I browsed the forum posts but I'm not sure I actually understand exactly what they're saying. Not up to speed yet but I'm working on it.

I do have some specific, basic questions about RESCRIPt:

  1. When you run the first step to get the SILVA data, does the data you download have both Bacteria and Archaea in it, or should I be looking for a specific version of the SILVA database?

  2. When you train the classifier, e.g. using classify-sklearn, do you use your own sequence data where it says '--i- reads' ?

Thanks for pointing me in the direction of classify-consensus-vsearch, I will definitely look into that option as an alternative.


1 Like
  1. Yes, both Bacteria and Archaea are included in SILVA.

  2. You use your own database data. This could be data from a new microbe you sequenced and assembled, or it could be stuff from an existing database like SILVA.

1 Like

Hi Colin
I ended up using classify-consensus-vsearch. It was a better option for me.


1 Like