I have decided to use R for my own continence as my pipeline recently. However, I really couldn’t find a method to train a region-specific classifier (based on my primers) according to SILVA 138 database in R. However, back then I have already trained my own classifier based on my primer-set in QIIME2 through this tutorial, which its format is .qza, while the format which assignTaxonomy() function in R knows is a .gz as the input for the classifier library. Is there any way to convert my silva138-classifier-341f-805r.qza to silva138-classifier-341f-805r.gz?
Much appreciated in advance.
Perhaps because this method and workflow are quite specific to QIIME 2. There are other taxonomy classifiers available in R, but they have their own workflows and input formats.
No. That file does not contain a gzipped set of DNA sequences, it contains a trained scikit-learn classifier, which would be unreadable by anything in R. So there is no way to export it and use that classifier in R.
note: I edited the title to make it more specific to your question. QZA is a vague extension (just as gz can contain any gzipped contents, a QZA can contain any QIIME 2 results).
I already downloaded the silva138 classifier but it is for the whole gene, while I want to have it region-specific to my primer sets. Since couldn’t find a workflow to make the classifier region-specific to my primer, do you think it would be alright if I just go with the whole classifier?
Are you asking rather “can RESCRIPt be used to create a classifier that can be exported and used in R?”
You could use RESCRIPt (following that tutorial) to compile and format a custom reference sequence database, then export those formatted sequences (prior to training the classifier) to fasta format and use them in R (e.g., for taxonomy classification). However, you cannot export a trained classifier and use it in R because it is in a very special format.
Hi Mike,
Yes I did this part for sure and trained my classifier. However, now I would like to use this classifier which I trained and specified it to my primers, to be used in my R workflow. Since I couldn’t find a way in R for training and making an amplicon-specific classifier, I was wondering if there is any ways to convert use my in-qiime-trained classifier in R.
Okay, then you can proceed as @Nicholas_Bokulich suggested above. Just take the formatted taxonomy and sequence files (the ones you’d input into the classifier) and import them into R instead. Then use your favorite R tools to train your reference database and classify. For example, you can likely use the approach from this pipeline.
I have sued the formatted taxonomy and sequence files as the input for the dada2:::makeTaxonomyFasta_SilvaNR() function in R and got the region-specific classifier based on my primers. However, when I used it for assigning the taxonomy I can see that compared to the non-region-specific classifier (silva_nr99_v138_wSpecies_train_set.fa.gz), I got a lot of NAs in different taxa levels.
This pic is for the results of region-specific classifier
Is there a reason why you link to the fasta file from the SILVA web site? The name you provide does not match the file name in the link. Was this intended?
I guess I’d need more details on how the reference data is handled and/or further processed within R prior to classification. I am unfamiliar with how this tool and its commands (e.g. makeTaxonomyFasta_SilvaNR) works. So, I’d suggest classifying through QIIME 2 for a comparison and sanity-check.
I assume you followed all of the “Make amplicon-region specific classifier” parts of the RESCRIPt tutorial? That is, the sequence and taxonomy dereplication steps, prior to importing them into R? Just asking to make sure I understand all the steps you’ve taken.
Following your tutorial I have made two artifacts, dereplicated sequences and taxa trimmed for 341f and 805r primer sets (ready to be used for training my classifier). Then I've converted them to fasta (the sequence) and tsv (the taxa) files by qiime exporter.
Then I tried to use them as the inputs for the "dada2:::makeTaxonomyFasta_SilvaNR(trimmed-seqs.fasta, full-taxa.txt, output=classifier.gz)". This function uses naiev-bayes method for training. However, reading the taxa.txt file, it kept giving me a format error. Then I tried to compare it with the standard one (full length) from the silva138 website, I realized that the taxa output of qiime has only two columns, Feature ID and Taxon:
To solve this error, I only used my trimmed (based on the primers) seqs and used the full taxa file instead of the trimmed one. And it ran successfully, but the results I got from this was different than when I used the silva138 full length classifier. Could that be the reason?