Classifier and taxonomy

Hi all, hopefully somebody can enlighten me :slight_smile:

I am using qiime2-2020.11 version, using which I trained my classifier, silva_138, 99% match.

Files used for training I pulled from here: https://docs.qiime2.org/2020.11/data-resources/

  • [Silva 138 SSURef NR99 full-length sequences]
  • [Silva 138 SSURef NR99 full-length taxonomy]

I trained the classifier using the following command:

qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads ref-seqs.qza
–i-reference-taxonomy ref-taxonomy.qza
–o-classifier classifier_silva138_99.qza

Successfully.

And when I analyse samples, on species level I get taxonomies like:
D_0__Bacteria;D_1__Bacteroidetes;D_2__Bacteroidia;D_3__Bacteroidales;D_4__Bacteroidaceae;D_5__Bacteroides;__

OR

D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__Lachnospiraceae;D_5__Blautia;__

I searched entire taxonomy txt file used for training the classifier and was not able to find such a classification. Can somebody help me? Why am I getting those undefined species ? Level D_6 is not marked in the above taxa at all. Why?

I wanted to blast those sequences, but I cannot even identify them in my classifier files.

Hi @Dzana_B,

Is there a reason why you did not simply use the pre-made trained classifiers available on the same page? Although you can use these files to make your own classifier, these files would not produce outputs with these taxonomy strings:

This tells us that you are using SILVA 132, or earlier, to classify your reads. The taxonomy string formatting of the SILVA classifiers have been updated a while ago for version SILVA 138, using a more GreenGenes-like formatting, i.e. d__Bacteria; p__Firmicutes; ... . This explains why you are unable to find matches between your classification and the SILVA 138 SSURef files you referred to above.

You'll have to download the corresponding classifiers that are available on the Data Resources page, or train them yourself using using the files you already have. Then you'll be able to re-run the taxonomy classification.

Note, it is always best to use the files that are prepared for the version of QIIME 2 that you have. I think we switched over to a newly formatted SILVA reference files in QIIME 2 version 2020.6.

2 Likes

@SoilRotifer thank you for your help :slight_smile:

You are right, I actually did use files from silva138 edition, just in taxonomy file I replaced level marks to match the rest of the program I am developing. So, I quickly replaced D_0__ with d__, and so on.

The reason I am not using pre-trained classifier is that I have some sequences blasted, and I was thinking of improving taxonomy at species level, i.e. I blast the sequence of this organism:
D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__Lachnospiraceae;D_5__Blautia;__

and find that I could add species label to this, i.e. Blautia luti, as it has pretty nice match.

Maybe this is not the best practice to follow, but I would definitely like to improve classification of bacteria in samples at this level.

Anyhow, now I did not do anything, I simply pulled the files from link, and trained the classifier. Tested the classifier on one sample, and it returned me again some bacteria I cannot find in starting tazonomy file, i.e:

d__Bacteria;p__Firmicutes;c__Clostridia;o__Peptococcales;f__Peptococcaceae;;

This makes me so confused.

Is that above equivalent to:

d__Bacteria; p__Firmicutes; c__Clostridia; o__Peptococcales; f__Peptococcaceae; g__uncultured; s__uncultured_bacterium

as I found this taxa in the taxonomy file.?

Qiime omits “uncultured” word from taxa? Or?

You do not want to do this. That old numbered rank system is inconsistent. See here for more details:

You can use our RESCRIPt tool to do something quite similar:

I assume the first taxonomy string is the result of taxonomy classification? If so, there is no easy way to know that classification is indeed equivalent to the second taxonomy string you listed. Well, without looking through the reference sequences anyway. :mag_right:

This issue is that the classifier likely could not disambiguate between the sequence with the taxonomy full taxonomy string d__Bacteria; p__Firmicutes; c__Clostridia; o__Peptococcales; f__Peptococcaceae; g__uncultured; s__uncultured_bacterium, versus several other reference sequences with nearly identical taxonomy (likely different genus and species strings. So, the classifier will return the lowest common ancestor (usually). In this case, it could not determine anything past the family level.

Nope, not unless you filter your sequence data or your reference database in that way. If you see any text after any of the prefixes, e.g. g__uncultured, then that is the taxonomy pulled directly from the reference database itself. RESCRIPt allows you the option of appending the organism name as the species label. However, be careful of trusting this. See the Species-labels: caveat emptor! section of the RESCRIPt tutorial that I linked above for more details.

Hope this helps! :man_technologist:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.