how to convert a trained taxonomy classifier to a gz file?

farhad1990 · January 11, 2021, 12:02pm

Hi everyone,

I have decided to use R for my own continence as my pipeline recently. However, I really couldn't find a method to train a region-specific classifier (based on my primers) according to SILVA 138 database in R. However, back then I have already trained my own classifier based on my primer-set in QIIME2 through this tutorial, which its format is .qza, while the format which assignTaxonomy() function in R knows is a .gz as the input for the classifier library. Is there any way to convert my silva138-classifier-341f-805r.qza to silva138-classifier-341f-805r.gz?
Much appreciated in advance.

Nicholas_Bokulich · January 11, 2021, 12:28pm

Perhaps because this method and workflow are quite specific to QIIME 2. There are other taxonomy classifiers available in R, but they have their own workflows and input formats.

No. That file does not contain a gzipped set of DNA sequences, it contains a trained scikit-learn classifier, which would be unreadable by anything in R. So there is no way to export it and use that classifier in R.

note: I edited the title to make it more specific to your question. QZA is a vague extension (just as gz can contain any gzipped contents, a QZA can contain any QIIME 2 results).

Good luck!

farhad1990 · January 11, 2021, 2:30pm

Thanks Nicholas,

I already downloaded the silva138 classifier but it is for the whole gene, while I want to have it region-specific to my primer sets. Since couldn't find a workflow to make the classifier region-specific to my primer, do you think it would be alright if I just go with the whole classifier?

Kinds

SoilRotifer · January 11, 2021, 3:15pm

Hi @farhad1990,

Earlier in this thread you said you worked through the RESCRIPt tutorial. Did you not try this part of the tutorial?

-Mike

Nicholas_Bokulich · January 11, 2021, 3:26pm

Ah thanks for clarifying @SoilRotifer — I think I understand your question now @farhad1990

Are you asking rather "can RESCRIPt be used to create a classifier that can be exported and used in R?"

You could use RESCRIPt (following that tutorial) to compile and format a custom reference sequence database, then export those formatted sequences (prior to training the classifier) to fasta format and use them in R (e.g., for taxonomy classification). However, you cannot export a trained classifier and use it in R because it is in a very special format.

farhad1990 · January 11, 2021, 4:15pm

Hi Mike,
Yes I did this part for sure and trained my classifier. However, now I would like to use this classifier which I trained and specified it to my primers, to be used in my R workflow. Since I couldn't find a way in R for training and making an amplicon-specific classifier, I was wondering if there is any ways to convert use my in-qiime-trained classifier in R.

Kinds,
Farhad

SoilRotifer · January 11, 2021, 5:06pm

Okay, then you can proceed as @Nicholas_Bokulich suggested above. Just take the formatted taxonomy and sequence files (the ones you'd input into the classifier) and import them into R instead. Then use your favorite R tools to train your reference database and classify. For example, you can likely use the approach from this pipeline.

farhad1990 · January 11, 2021, 5:10pm

Thanks Mike,
I will go for it

Kinds,
Farhad

farhad1990 · January 11, 2021, 7:35pm

That was the exact question I asked and thanks for the answer. I am currently working on it.

farhad1990 · January 18, 2021, 9:17pm

Hi Mike,

I have sued the formatted taxonomy and sequence files as the input for the dada2:::makeTaxonomyFasta_SilvaNR() function in R and got the region-specific classifier based on my primers. However, when I used it for assigning the taxonomy I can see that compared to the non-region-specific classifier (silva_nr99_v138_wSpecies_train_set.fa.gz), I got a lot of NAs in different taxa levels.
This pic is for the results of region-specific classifier

and this is for the full seq classifier

Is it expected or something is wrong?

Kinds,
Farhad

SoilRotifer · January 18, 2021, 11:17pm

Difficult to say.

Is there a reason why you link to the fasta file from the SILVA web site? The name you provide does not match the file name in the link. Was this intended?

I guess I'd need more details on how the reference data is handled and/or further processed within R prior to classification. I am unfamiliar with how this tool and its commands (e.g. makeTaxonomyFasta_SilvaNR) works. So, I'd suggest classifying through QIIME 2 for a comparison and sanity-check.

I assume you followed all of the "Make amplicon-region specific classifier" parts of the RESCRIPt tutorial? That is, the sequence and taxonomy dereplication steps, prior to importing them into R? Just asking to make sure I understand all the steps you've taken.

Check out this thread: training classifiers: performance of full-length vs. extract-reads for some additional insights.

Do you have anything to add, @Nicholas_Bokulich ?

-Mike

TurboQiimer · January 19, 2021, 1:12am

Hi,
Regarding the photo you shared, can I ask what environment it is?
Thanks
Qiimer

farhad1990 · January 19, 2021, 3:40pm

Hi Mike,

Following your tutorial I have made two artifacts, dereplicated sequences and taxa trimmed for 341f and 805r primer sets (ready to be used for training my classifier). Then I've converted them to fasta (the sequence) and tsv (the taxa) files by qiime exporter.
Then I tried to use them as the inputs for the "dada2:::makeTaxonomyFasta_SilvaNR(trimmed-seqs.fasta, full-taxa.txt, output=classifier.gz)". This function uses naiev-bayes method for training. However, reading the taxa.txt file, it kept giving me a format error. Then I tried to compare it with the standard one (full length) from the silva138 website, I realized that the taxa output of qiime has only two columns, Feature ID and Taxon:

it seems compeletely different than the full taxa file I got from silva138 database:

To solve this error, I only used my trimmed (based on the primers) seqs and used the full taxa file instead of the trimmed one. And it ran successfully, but the results I got from this was different than when I used the silva138 full length classifier. Could that be the reason?

farhad1990 · January 19, 2021, 3:46pm

Hi,

It is Rstudio in Jupyter notebook.

Kinds,
Farhad

system · February 20, 2021, 2:17am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.