SILVA 138 Classifiers

Fixed! Thank you @potatoo!

1 Like

I added a V3V4 (341F-805R) set here.

EDIT (Feb 2, 2023): Reminder, you can now do this yourself with RESCRIPt. See the link at the top of this thread.

7 Likes

Thank you very much for the updates, Mike Robeson.
I’ve try using this classifiers for my microbiome data. But there are some confusing taxonomic data on species level, such as:

d__Bacteria;p__Gemmatimonadota;c__Gemmatimonadetes;o__Gemmatimonadales;f__Gemmatimonadaceae;g__uncultured;s__uncultured_actinobacterium

d__Bacteria;p__Myxococcota;c__Myxococcia;o__Myxococcales;f__Anaeromyxobacteraceae;g__Anaeromyxobacter;s__uncultured_proteobacterium

d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__uncultured;f__uncultured;g__uncultured;s__uncultured_actinobacterium

,etc, which the species level (uncultured_actinobacterium, uncultured_proteobacterium) were not from the same phylum (bold fonts). Is it form the primary database source or else?

Thanks.

And there’s problem when I use it with the new qiime2-2020.2:

“The scikit-learn version (0.21.2) used to generate this artifact does not match the current version of scikit-learn installed (0.22.1). Please retrain your classifier for your current deployment to prevent data-corruption errors.”

Hi @didietkeren, if you read my pipeline notes as well as this post, you’ll see why I do not trust species-level taxonomy. As far as I know SILVA does not curate the taxonomy beyond the genus level. :microbe: As a result, there will be be potential conflicts with the species labels and upper-level taxonomy. This has been noted by other research groups too, see here. Which is why I made two versions of the reference database, with and without the species label. :construction_worker_man:

4 Likes

I’ve not had time to re-train the classifiers for the latest version of QIIME 2 (2020.2). However you can follow the procedure outlined here to train the classifiers yourself. You can do this by simply making use of the sequence and taxonomy qza files I’ve made available. :clamp:

5 Likes

Thank you SoilRotifer for your precious work!
I'm novice here, so if I wrote a silly question please accept my apologies.
I'm trying to train a SILVA 138 classifier with 520-926r primers.
The problem I have is at point 6 of your pipeline that gives me an error:

" filter_fasta_by_seq_id.py: error: unrecognized arguments: -f SILVA_align_seqs.fasta "

should I change -f to -i as an input file? I saw that -f is not an argument defined in your python script.

image

Thanks!

Hi @Stefano_S,

Thanks for finding that typo! Yes, it should be -i. For any of the scripts, if you type -h, as in:

filter_fasta_by_seq_id.py -h

You'll see some help text with the appropriate options. As ya'll are finding, this has been very much a work in progress. :hammer_and_wrench:

-Mike

3 Likes

@SoilRotifer thanks for the classifier.
Using these classifiers(V3V4), I noted that some taxa have the same name at lower levels:
d__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridia_UCG-014;f__Clostridia_UCG-014;g__Clostridia_UCG-014

d__Bacteria;p__Firmicutes;c__Bacilli;o__RF39;f__RF39;g__RF39

d__Bacteria;p__Firmicutes;c__Clostridia;o__Oscillospirales;f__UCG-010;g__UCG-010

Is it right to call them unclassified order__Clostridia_UCG-014, order__RF39, family__UCG-010 respectively?

What is the difference between these and taxa which are assigned ;__ at lower taxa levels. Eg. d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;__
or d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;;

Thanks

Hi @prince,

To clarify, we do not curate the taxonomy, we are simply parsing the files as provided by SILVA. As for the upper-level taxonomy being propagated downward... this is intended by me. The reason for this is that not all reference sequences have taxonomic annotations for each rank. When a particular rank is missing, in this case the family level, the taxonomic rank above is propagated downward to a lower rank position until it finds a rank at a lower level, then it propagates that rank downwards towards, and so on... to the genus or species level. For example:

d__Bacteria; p__Firmicutes; c__Clostridia; o__Peptostreptococcales-Tissierellales; f__Peptostreptococcales-Tissierellales; g__Peptoniphilus; s__Peptoniphilaceae_bacterium

Note that the o__ rank is propagated down to f__, but we found rank information for g__ and s__.

This is propagation of rank information is a convention followed by other research groups and tool developers too. This makes it easier to meet the requirements of various taxonomy classifiers (e.g. some tools require that all reference sequences have the same number of ranks).

While this might be technically correct, this would make parsing taxonomy onerous. Downstream analyses may fail as there would be two o__ levels. This is why we prepend each rank with o__, f__ , etc... to provide some level of unique rank-level information.

Basically consider these annotations as saying: "We are using the name Clostridia_UCG-014 to fill the f__ slot, and again for the g__ slot." This is how these taxonomy annotations should be interpreted, e.g. there are many cases of s__gut_metagenome, which is not a legitimate taxonomic rank in its own right. So, the annotation gut_metagenome is being used to fill-in the s__ slot.

This is not perfect, but what we have to work with... Curating taxonomy is hard work, which is why we greatly appreciate those that do it! :love_you_gesture:

As for the ;__, there are a few answers to this on the forum, but here is one explanation:

-Hope this helps!
-Best wishes. :slight_smile:

4 Likes

Thanks @SoilRotifer.
Got it :ok_hand: :+1:

Hi @SoilRotifer,
Once again, thanks for the clarification.

In reference to these:

As for the upper-level taxonomy being propagated downward… this is intended by me........When a particular rank is missing, in this case the family level, the taxonomic rank above is propagated downward to a lower rank position until it finds a rank at a lower level, then it propagates that rank downwards towards, and so on… to the genus or species level.

is
d__Bacteria;p__Firmicutes;c__Clostridia; o__Clostridia_UCG-014;f__Clostridia_UCG-014;g__Clostridia_UCG-014

equivalent to

d__Bacteria;p__Firmicutes;c__Clostridia; o__Clostridia_UCG-014;f__;g__

This particular sequence is classified as d__Bacteria;p__Firmicutes;c__Clostridia; o__Clostridia_UCG-014;f__Clostridia_UCG-014;g__Clostridia_UCG-014;__

I guess the reference sequences do not have annotations beyond o__ rank.

Can you help me clarify this?

1 Like

Right, I just realized that I did not provide enough details for one particular case… I forgot that you had mentioned the use of the V3V4 (amplicon specific) classifier. When these are made, any short sequence that became identical in sequence (after extraction of the amplicon region), but had different taxonomic annotation, will effectively have their taxonomy truncated to the lowest common ancestor (LCA). See the pipeline of the original post of this thread.

So, in this case, the lower rank information would not be available. But the base reference files do contain all the taxonomic ranks as I outlined earlier. This changes when we are extracting the amplicon region, as we are more likely to have identical sequences over various shorted regions of the gene.

Does this help?

-Best
-Mike

1 Like

Yes.
Thanks @SoilRotifer

1 Like

Thanks so much for this. Enormously helpful and much appreciated. Any difference between ver_0.01 and ver_0.02?

Also, mostly out of curiosity and interest to learn, is training the classifier directly using the primer sequences on QIIME2 not advisable? I saw your comment about retaining more sequences and was hoping you could expand on it a little.