Training a classifier

Mehrdad · April 3, 2019, 10:42pm

Hi,

I have a question about training a classifier.

For training a classifier, we need two main components, as we know:
the reference sequences and the corresponding taxonomic classifications
https://docs.qiime2.org/2019.1/tutorials/feature-classifier/

To begin training, it is required to import the files as follows:

qiime tools import
--type 'FeatureData[Sequence]'
--input-path 85_otus.fasta
--output-path 85_otus.qza

qiime tools import
--type 'FeatureData[Taxonomy]'
--input-format HeaderlessTSVTaxonomyFormat
--input-path 85_otu_taxonomy.txt
--output-path ref-taxonomy.qza

I'd like to work with SILVA database. I downloaded the 10-Apr-2018 15:39:14 SILVA version here:

Screenshot%20from%202019-04-03%2014-10-20

Six directories there are which their contents are same.

In the taxonomy file, there are three directories for 16S, 18S and taxonomy all that there contents also same and surprisingly in the same size, 182.6 GB.
Screenshot%20from%202019-04-03%2014-12-50

First, I do not know what is their differences. Second, I do not know which one of 99% similarity directories in the directories is proper to train a classifier.

And finally, I did not see any file with .fasta format in the all directories to this importing command:

qiime tools import
--type 'FeatureData[Sequence]'
--input-path 85_otus.fasta
--output-path 85_otus.qza

I looked at the thread that recommended using rep-set file, not aligned sequence. But in side the rep set contained file is .fna format.

And also I considered this well-explained thread too:

but it has zoomed in Greengene database whereas tge databse I tended is SILVA, so it has some differences for example SILVA does not have OTUs file, and it does contain .fasta format at all.

As a conclusion, I know I have to use rep-set or taxonomy file that are in SILVA database but the files inside the file are not in .fasta format. (.fna)

second, I am unaware of reference sequences what they are (we know two components are associated with classifying: the reference sequences and the corresponding taxonomic classifications) Is the reference sequence in the SILVA database? In SILVA database I did not see this item.

And can I use the raw data file in training?

I failed to understand these questions.

Thanks again for your support.

Mehrbod_Estaki · April 3, 2019, 10:57pm

Hi @Mehrdad,
The .fasta and .fna file are essentially the same type of file type with different extensions.
I believe you are working with bacteria so you obviously want to look into files within the 16S folder. The 99% .fna file is your reference taxonomy and its corresponding file within the taxonomy folder is your taxonomy file. Now you have both files you need to train your classifier.

Not sure what you mean but you do have to import them as Qiime2 artifacts as per the tutorial you quote before running them through training.

system · May 5, 2019, 5:12am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.