I am new to working with taxonomic data. I denoised my data and now want to train my classifier using the SILVA database and assign taxonomy to my data. I have separate files for my 16S and 18S data.
I was wondering if I should filter the SILVA database first based on Bacteria and Eukaryota or is it better to filter based on my used primers?
I am currently trying to run the following script for my 16S data, but it does not work:
I'd suggest not removing eukaryote / 18S sequences from the reference database. These act as good decoy / outgroup reference reads for the classifier.
The reason for this is that many of the standard 16S primers can, and often do, amplify 18S (and other off target) sequences. Thus, it is good to keep these in so that you can remove any reads that are identified as such. Otherwise you might erroneously classify reads as Bacteria / Archaea, when they are in fact Eukaryotes.
As @colinbrislawn mentioned, please provide information about the types of errors you are running into.
Error: QIIME 2 plugin 'rescript' has no action 'filter-seqs-by-taxonomy'. Did you mean 'filter-seqs-length-by-taxon'?
"I'd suggest not removing eukaryote / 18S sequences from the reference database. These act as good decoy / outgroup reference reads for the classifier. " --> so Do I still need to filter or just make a classifier as follows:
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads silva-138-99-seqs.qza \
--i-reference-taxonomy silva-138-99-tax.qza \
--o-classifier classifier-silva-138-99.qza
The error message is telling you exactly why it's failing:
In fact, there is no such action called filter-seqs-by-taxonomy in QIIME 2. The action you are looking for is likely qiime taxa filter-seqs ....
But yes, if you do not want to apply any other curation steps, then you can use the qiime feature-classifier fit-classifier-naive-bayes command as you have it.
The SILVA tutorial is only meant to show what is possible with RESCRIPt, it's not necessarily an SOP. Curate as you think is appropriate.
That being said, at the very least, qiime rescript cull-seqs ... and qiime rescript dereplicate ... should be run prior to constructing the classifier. There are some poor quality sequences (some contain many ambiguous or missing bases) within the database, which should probably be removed, as they may interfere with classification.
Thank you very much for the response. If I understand correctly, if I just download the upper two files the cull-seqs and dereplicate have already happened?
You can... but those are older preprocessed SILVA files (i.e version 132), the current SILVA version is 138, with current taxonomic labels applied. I'd recommend that you process the SILVA database yourself if possible.
If you want details on how the files were processed you can look at the provenance information of the QZA files.
The page shown in your screenshot is from the old QIIME 2 documentation. The docs are currently in transition as we migrate to a new website, and the latest data resources are hosted here:
The latest pre-trained classifiers there (at the time of writing) were trained on SILVA 138.1. There was a more recent update, SILVA 138.2, in which they fixed some minor aspects of the taxonomy (you can find more details here). These are also compatible with the latest release of QIIME 2 (the files you referred to previously are from 2024, so not the latest)
You can use the RESCRIPt plugin to create your own pre-trained classifier from any SILVA version (no need to download and format manually). So if you want 138.2, you can use RESCRIPt.