Hello,
for the first time I'm using the taxonomy classifiers in QIIME2 for the taxonomy assignment to my fungal (18S) and bacterial (16S) MiSeq sequences. For now, I'd like to use the SILVA database. I've worked with metabarcoding before, but training a classifier is new to me.
If I get it right, I have two options:
- Using the pre-trained " Silva 138 99% OTUs full-length sequences" file, found at the top of: Data resources — QIIME 2 2022.2.0 documentation
This classifier is not trained on my specific primers. So to get this classifier, all reference sequences of SILVA where included to train the classifier, instead of first extracting your own reads? (in contrary to option 2 below ) This is a file that can be used for classifying all 16S/18S sequences right? Are there any downsides for using this classifier for taxonomic assignment of fungal/bacterial 18S/16S sequences? (apart from that it does not focus on your region of interest, such as option 2 below)
I've googled/searched the forum for some more information on training a classifier, but I still do not fully understand it. I know that using a trained classifier improves performance, but why is this exactly? Why not 'just' use a non trained reference database such as the silva database?
- I can extract the reads from the SILVA database of with my particular primers and then train the classifier:
For that I would first download " Silva 138 SSURef NR99 full-length sequences" and Silva 138 SSURef NR99 full-length taxonomy from Data resources — QIIME 2 2022.2.0 documentation under "Marker Gene reference Databases".
Do I understand it correcly that the downloads under the "Marker Gene reference database" are "raw" databases, from which you can extract your own reads?
Then I would extract my region of interest:
qiime feature-classifier extract-reads \
--i-sequences silva-138-99-seqs.qza \
--p-f-primer GTGCCAGCMGCCGCGGTAA \
--p-r-primer GGACTACHVGGGTWTCTAAT \
--p-identity 0.8 \
--p-min-length 175 \
--p-max-length 500 \
--o-reads silva-138-99_ref-seqs_extracted.qza
and then train the classifier:
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads silva-138-99_ref-seqs_extracted.qza \
--i-reference-taxonomy silva-138-99-tax.qza \
--o-classifier classifier.qza
Now I have a fully functional classifier trained on my specific region of interest, which I can use for taxonomic assignment of my reads.
This is the correct way right?
Thank you very much!