SILVA 138 Classifiers

Hi @M_R ,

The NR99 as described in the pipeline link of the original post. :slight_smile:

-Mike

2 Likes

Hi @SoilRotifer
Thank you for your awesome job. it helps a lot .
Just one notice, there’s a mistake (dunno if not on purpose) in the ‘coVNert_rna_to_dna.py’
script name, I assume it should be: ‘coNVert_rna_to_dna.py’

3 Likes

Hahaha! Thank you @T.J.Sanko for that nice, yet embarrassing, catch! :man_facepalming:

I’m glad this is useful! I will fix post haste! :slight_smile:

-Mike

2 Likes

Thank you for your great job. I encountered a problem when I ran the pipeline in qiime2-2019.10.

parse_silva_taxonomy.py: /ur/bin/env: bad interpreter: No such file or directory

I assume “#!/ur/bin/env” should be: “#!/usr/bin/env”.

potatoo

1 Like

Fixed! Thank you @potatoo!

1 Like

I added a V3V4 (341F-805R) set here.

4 Likes

Thank you very much for the updates, Mike Robeson.
I’ve try using this classifiers for my microbiome data. But there are some confusing taxonomic data on species level, such as:

d__Bacteria;p__Gemmatimonadota;c__Gemmatimonadetes;o__Gemmatimonadales;f__Gemmatimonadaceae;g__uncultured;s__uncultured_actinobacterium

d__Bacteria;p__Myxococcota;c__Myxococcia;o__Myxococcales;f__Anaeromyxobacteraceae;g__Anaeromyxobacter;s__uncultured_proteobacterium

d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__uncultured;f__uncultured;g__uncultured;s__uncultured_actinobacterium

,etc, which the species level (uncultured_actinobacterium, uncultured_proteobacterium) were not from the same phylum (bold fonts). Is it form the primary database source or else?

Thanks.

And there’s problem when I use it with the new qiime2-2020.2:

“The scikit-learn version (0.21.2) used to generate this artifact does not match the current version of scikit-learn installed (0.22.1). Please retrain your classifier for your current deployment to prevent data-corruption errors.”

Hi @didietkeren, if you read my pipeline notes as well as this post, you’ll see why I do not trust species-level taxonomy. As far as I know SILVA does not curate the taxonomy beyond the genus level. :microbe: As a result, there will be be potential conflicts with the species labels and upper-level taxonomy. This has been noted by other research groups too, see here. Which is why I made two versions of the reference database, with and without the species label. :construction_worker_man:

4 Likes

I’ve not had time to re-train the classifiers for the latest version of QIIME 2 (2020.2). However you can follow the procedure outlined here to train the classifiers yourself. You can do this by simply making use of the sequence and taxonomy qza files I’ve made available. :clamp:

4 Likes

Thank you SoilRotifer for your precious work!
I’m novice here, so if I wrote a silly question please accept my apologies.
I’m trying to train a SILVA 138 classifier with 520-926r primers.
The problem I have is at point 6 of your pipeline that gives me an error:

" filter_fasta_by_seq_id.py: error: unrecognized arguments: -f SILVA_align_seqs.fasta "

should I change -f to -i as an input file? I saw that -f is not an argument defined in your python script.

image

Thanks!

Hi @Stefano_S,

Thanks for finding that typo! Yes, it should be -i. For any of the scripts, if you type -h, as in:

filter_fasta_by_seq_id.py -h

You’ll see some help text with the appropriate options. As ya’ll are finding, this has been very much a work in progress. :hammer_and_wrench:

-Mike

3 Likes

@SoilRotifer thanks for the classifier.
Using these classifiers(V3V4), I noted that some taxa have the same name at lower levels:
d__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridia_UCG-014;f__Clostridia_UCG-014;g__Clostridia_UCG-014

d__Bacteria;p__Firmicutes;c__Bacilli;o__RF39;f__RF39;g__RF39

d__Bacteria;p__Firmicutes;c__Clostridia;o__Oscillospirales;f__UCG-010;g__UCG-010

Is it right to call them unclassified order__Clostridia_UCG-014, order__RF39, family__UCG-010 respectively?

What is the difference between these and taxa which are assigned ;__ at lower taxa levels. Eg. d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;__
or d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;;

Thanks

Hi @prince,

To clarify, we do not curate the taxonomy, we are simply parsing the files as provided by SILVA. As for the upper-level taxonomy being propagated downward… this is intended by me. The reason for this is that not all reference sequences have taxonomic annotations for each rank. When a particular rank is missing, in this case the family level, the taxonomic rank above is propagated downward to a lower rank position until it finds a rank at a lower level, then it propagates that rank downwards towards, and so on… to the genus or species level. For example:

d__Bacteria; p__Firmicutes; c__Clostridia; o__Peptostreptococcales-Tissierellales; f__Peptostreptococcales-Tissierellales; g__Peptoniphilus; s__Peptoniphilaceae_bacterium

Note that the o__ rank is propagated down to f__, but we found rank information for g__ and s__.

This is propagation of rank information is a convention followed by other research groups and tool developers too. This makes it easier to meet the requirements of various taxonomy classifiers (e.g. some tools require that all reference sequences have the same number of ranks).

While this might be technically correct, this would make parsing taxonomy onerous. Downstream analyses may fail as there would be two o__ levels. This is why we prepend each rank with o__, f__ , etc… to provide some level of unique rank-level information.

Basically consider these annotations as saying: “We are using the name Clostridia_UCG-014 to fill the f__ slot, and again for the g__ slot.” This is how these taxonomy annotations should be interpreted, e.g. there are many cases of s__gut_metagenome, which is not a legitimate taxonomic rank in its own right. So, the annotation gut_metagenome is being used to fill-in the s__ slot.

This is not perfect, but what we have to work with… Curating taxonomy is hard work, which is why we greatly appreciate those that do it! :love_you_gesture:

As for the ;__, there are a few answers to this on the forum, but here is one explanation:

-Hope this helps!
-Best wishes. :slight_smile:

4 Likes

Thanks @SoilRotifer.
Got it :ok_hand: :+1:

Hi @SoilRotifer,
Once again, thanks for the clarification.

In reference to these:

As for the upper-level taxonomy being propagated downward… this is intended by me…When a particular rank is missing, in this case the family level, the taxonomic rank above is propagated downward to a lower rank position until it finds a rank at a lower level, then it propagates that rank downwards towards, and so on… to the genus or species level.

is
d__Bacteria;p__Firmicutes;c__Clostridia; o__Clostridia_UCG-014;f__Clostridia_UCG-014;g__Clostridia_UCG-014

equivalent to

d__Bacteria;p__Firmicutes;c__Clostridia; o__Clostridia_UCG-014;f__;g__

This particular sequence is classified as d__Bacteria;p__Firmicutes;c__Clostridia; o__Clostridia_UCG-014;f__Clostridia_UCG-014;g__Clostridia_UCG-014;__

I guess the reference sequences are do not have annotations beyond o__ rank.

Can you help me clarify this?