training classifier SILVA: what is the best way?

Hi,

I am new to working with taxonomic data. I denoised my data and now want to train my classifier using the SILVA database and assign taxonomy to my data. I have separate files for my 16S and 18S data.

I was wondering if I should filter the SILVA database first based on Bacteria and Eukaryota or is it better to filter based on my used primers?

I am currently trying to run the following script for my 16S data, but it does not work:

module load QIIME2/2024.10.1-foss-2023b-amplicon

qiime rescript filter-taxa \
  --i-taxonomy silva-138-99-tax.qza \
  --p-include Bacteria,Archaea \
  --p-exclude Eukaryota \
  --o-filtered-taxonomy silva-138-99-16S-tax.qza
  
qiime rescript filter-seqs-by-taxonomy \
  --i-sequences silva-138-99-seqs.qza \
  --i-taxonomy silva-138-99-16S-tax.qza \
  --o-filtered-sequences silva-138-99-16S-seqs.qza
  
qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads silva-138-99-16S-seqs.qza \
  --i-reference-taxonomy silva-138-99-16S-tax.qza \
  --o-classifier classifier-silva-138-99-16S.qza

qiime feature-classifier classify-sklearn \
  --i-classifier classifier-silva-138-99-16S.qza \
  --i-reads dada2_rep-seqs16S.qza \
  --o-classification taxonomy_16S.qza

What I am doing wrong?

Kind regards,

Silke

Hello Silke,

Welcome to the forums! :qiime2:

Thank you for posting all the commands you use to build this database.

Can you tell us more about how it fails? Any additional error messages and warnings would be very helpful for us.

Hi @Silke_Lambert,

I'd suggest not removing eukaryote / 18S sequences from the reference database. These act as good decoy / outgroup reference reads for the classifier.

The reason for this is that many of the standard 16S primers can, and often do, amplify 18S (and other off target) sequences. Thus, it is good to keep these in so that you can remove any reads that are identified as such. Otherwise you might erroneously classify reads as Bacteria / Archaea, when they are in fact Eukaryotes.

As @colinbrislawn mentioned, please provide information about the types of errors you are running into. :slight_smile:

1 Like
Error: QIIME 2 plugin 'rescript' has no action 'filter-seqs-by-taxonomy'.  Did you mean 'filter-seqs-length-by-taxon'?

"I'd suggest not removing eukaryote / 18S sequences from the reference database. These act as good decoy / outgroup reference reads for the classifier. " --> so Do I still need to filter or just make a classifier as follows: 

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads silva-138-99-seqs.qza \
  --i-reference-taxonomy silva-138-99-tax.qza \
  --o-classifier classifier-silva-138-99.qza

Hi @Silke_Lambert,

The error message is telling you exactly why it's failing:

In fact, there is no such action called filter-seqs-by-taxonomy in QIIME 2. The action you are looking for is likely qiime taxa filter-seqs ....

But yes, if you do not want to apply any other curation steps, then you can use the qiime feature-classifier fit-classifier-naive-bayes command as you have it.

-Mike

And is it okay to not use any curation steps? will the classification be accurate?

Hi @Silke_Lambert,

The SILVA tutorial is only meant to show what is possible with RESCRIPt, it's not necessarily an SOP. Curate as you think is appropriate. :slight_smile:

That being said, at the very least, qiime rescript cull-seqs ... and qiime rescript dereplicate ... should be run prior to constructing the classifier. There are some poor quality sequences (some contain many ambiguous or missing bases) within the database, which should probably be removed, as they may interfere with classification.

-Cheers!

3 Likes

Hi Mike,

Thank you very much for the response. If I understand correctly, if I just download the upper two files the cull-seqs and dereplicate have already happened?

Kind regards,

Silke

Hi @Silke_Lambert,

You can... but those are older preprocessed SILVA files (i.e version 132), the current SILVA version is 138, with current taxonomic labels applied. I'd recommend that you process the SILVA database yourself if possible.

If you want details on how the files were processed you can look at the provenance information of the QZA files.

Hi Mike,

So Silva 138 SSURef NR99 full-length sequences and Silva 138 SSURef NR99 full-length taxonomy are not the latest pre-formatted 138 version files?

So I need to go to the SILVA website and download the files and pre-format them myself? Using cull-seqs and dereplicate?

Kind regards,

Silke

Hi @Silke_Lambert ,

The page shown in your screenshot is from the old QIIME 2 documentation. The docs are currently in transition as we migrate to a new website, and the latest data resources are hosted here:

The latest pre-trained classifiers there (at the time of writing) were trained on SILVA 138.1. There was a more recent update, SILVA 138.2, in which they fixed some minor aspects of the taxonomy (you can find more details here). These are also compatible with the latest release of QIIME 2 (the files you referred to previously are from 2024, so not the latest)

You can use the RESCRIPt plugin to create your own pre-trained classifier from any SILVA version (no need to download and format manually). So if you want 138.2, you can use RESCRIPt.

I hope that helps clarify!

5 Likes

2 off-topic replies have been split into a new topic: Train PR2 Classifier

Please keep replies on-topic in the future.