Rep.Seqs. qza fungal file/data to train the classifier

Fabs · June 25, 2018, 3:39am

Hello everyone,

I am working with fungal data using ITS1F/ITS2 and I am trying to train a feature classifier. I have both of the files needed to begin the training, reference sequences and the corresponding taxonomic classifications, but I am not sure where I can get a rep.seqs.qza and or equivalent file to train the classifier.

If anyone has any ideas I would really appreciate it.

Nicholas_Bokulich · June 25, 2018, 11:57pm

Hi @Fabs
Have you seen this tutorial?

That contains all the steps you need, including importing the necessary files.

Let us know if that's what you're looking for! Good luck!

Fabs · June 26, 2018, 3:19am

Hi Nocholas,

I am actually following that tutorial, but the third file uploaded here, is a file created in tutorial 1 (Moving Pictures) which we know what each sequence belongs to. In my case, I need a file containing fungal sequences which correlate to ITS1/ITS2 regions, so that I can train the machine to data similar to what I will be working with. The problem is, I do not know where to get the data to train the machine on.

This is the file required:

Mehrbod_Estaki · June 26, 2018, 6:35am

Hi @Fabs,
Your original question asked where you could get the rep.seqs.qza while the image above is highlighting ref-seqs.qza. Just to clarify the difference in case this is what is causing the confusion.
You will need to import two files from your known database. Let's say for ITS you are using UNITE, so you will need the actual reference reads (in the tutorial this is called: 85_otus.fasta) and the taxonomies (in the tutorial, 85_otu_taxonomy.txt). You can obtain these from this link which is also provided at the bottom of the tutorial.
In the tutorial we extract the reads pertaining to our specific primers and length from the 85_otus.fasta file. This now becomes our ref-seq.qza. You train the classifier on this, though please note the recommendations at the very bottom of the tutorial which suggests that for using ITS classification training on UNITE, avoid extracting/trimming.

The representative sequences (rep-seqs.qza) is a separate artifact, this is a list of representative sequences that has been obtained from your actual data. You obtain this when you denoise your original data and it is part of the output from DADA2 or deblur.

Hope that helps!

Fabs · June 28, 2018, 3:36am

Hi Mehrbod,

Sorry about the confusion. I meant the rep-seqs.qza. I did notice that is based on the tutoria 1, but I thought that since I am training the machine to understand my data, that I would train it on data that has already been properly ID, so that I can compare the training results with the actual results for that data, since that would show me that QIIME is properly identifying the fungal identity, or is this not the case?

Nicholas_Bokulich · June 28, 2018, 12:02pm

short answer: The classifier is only trained on the reference sequences and taxonomy, as shown in the tutorial. Training is never performed on the representative sequences (e.g., ASVs) from your own dataset, since you do not actually know the IDs for any of those sequences (even after you classify them, they are still just predictions!)

See the moving pictures tutorial or overview tutorial to better understand where the query samples come from in the context of an entire experiment.

long answer: sounds like maybe you are confusing feature classification with sample classification. That's fair — they are both very similar processes! But with different inputs and goals. In both cases you need to train your classifier on known data. We can train a sample classifier (e.g., for predicting sample metadata based on microbial composition) on our own dataset because we (presumably) know the classes (e.g., metadata values). Feature classifiers (for predicting microbial identity based on sequence composition) must similarly be trained on known data. That is the reference sequence data, which you already have — it is never trained on representative sequences (e.g., ASVs) from your dataset. Those representative sequences are instead used as query sequences, i.e., the unknown features that you are attempting to classify.

I hope that helps clarify!

Fabs · June 28, 2018, 11:44pm

I see, so I would run the feature training classifier using the .fasta and .txt file downloaded from UNITE and the representative sequence will be will be composed my denoised data (rep.seq.qza) files, correct?

By any chance, is there a manual and or video that explain the feature training classifier more in depth. I want to make sure that I properly understand what I am doing.

Nicholas_Bokulich · June 28, 2018, 11:52pm

correct 100%

feature classifier tutorial
overview tutorial
this paper gives some technical details

system · July 30, 2018, 5:52am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.