Silva classifier seven level code

balford · October 17, 2019, 3:49pm

Hi Qiime folks,
I am having trouble finding the code that the qiime2 developers used to collapse the silva arb down to the classifier friendly 7 levels related to the recent dada2 discussion. I suspect it's posted somewhere, I'm just not finding it using the usual avenues.

Cheers,
Betsy

Nicholas_Bokulich · October 17, 2019, 3:57pm

Hi @balford,

The QIIME 2 developers are not involved in development or maintenance of SILVA. As far as I know the SILVA developers release the "QIIME formatted" taxonomy and sequences on the SILVA website and I believe the README file contained within that directory explains what formatting has been performed.

I think this may be what you are looking for:

cc: @SoilRotifer

SoilRotifer · October 17, 2019, 5:54pm

Some background as to why the 18S rank information is inconsistent is outlined here:

-Mike

balford · October 17, 2019, 6:09pm

Hi Mike and Nick-I appreciate the prompt replies.

Perhaps I should provide additional clarification. I am attempting to find the documentation for the code that was used to collapse the full silva taxa down to seven levels found in this page Archive. The readme file states that qiime should be contacted.

If this is incorrect, can you provide point of contact at silva?

Cheers,
Betsy

SoilRotifer · October 17, 2019, 6:32pm

Not a problem @balford.

The link that @Nicholas_Bokulich provided earlier, refers to this information. However, I should have mentioned more explicitly, that the nitty-gritty details are specifically contained within the Silva_119_notes.txt file of the Silva v119 release and not the general README file of the SILVA qiime folder.

Sorry about that.

-Mike

Nicholas_Bokulich · October 17, 2019, 6:37pm

And just to clarify my comments in the context of that passage: that must be a relic of the QIIME 1 days... the QIIME 2 developers do not do any of this reformatting, we leave it up to SILVA to manage their own releases.

SoilRotifer · October 17, 2019, 6:51pm

That information is a bit dated, from the QIIME 1 days. But I think that formatting approach generally holds for the later releases too. Actually, I also forgot to mention if you download the Silva 132 release, there is a Silva_132_notes.txt file that describes how the reference database was constructed for QIIME.

-Mike

balford · October 17, 2019, 8:07pm

Thanks Nick and Mike for your help with this! I found what I am looking for in the parse_to_7_taxa_levels.py script mentioned in the Silva_132_notes.txt. I'm going to try to summarize what I think I understand here so it's documented in the forum for future folks.

When silva has a new release (say the upcoming 138 release, for example), the silva folks place a zip folder in Archive containing among other things, the full silva database, the 7 layer 16/18S, the 18S 7 layer etc.

The script that silva used to collapse the taxonomy to seven levels is Mike's parse_to_7_taxa_levels.py located here: # Usage: python parse_to_7_taxa_levels.py X Y # where X is the input taxonomy mapping file, Y is the output taxonomy mapping file # Purpose is to parse output of Mike Robeson's script to force taxa into # 7 levels. · GitHub

The basic rules of full silva down to 7 levels
-if 7 levels keep 7 levels
-if less than 7 levels, keep all levels and fill missing levels with "unclassified"
-if greater than 7, keep only the top 4 levels and the bottom 3 levels

Correct?

SoilRotifer · October 17, 2019, 8:15pm

You generally have it. But the 7-levels may not particularly work well for the Eukaryotes (unless you make your own modifications), which is why we provide other file versions with the full taxonomy strings.

But one point of clarification, usually a member of the QIIME team downloads the raw data from SILVA, and carries out the formatting pipeline as outlined in the Silva_132_notes.txt file. After this is completed, we work with SILVA to make these files available to the community.

Also, to give proper credit the parse_to_7_taxa_levels.py was written by @William

In the post I linked / referred to earlier, we think we have a way of alleviating these taxonomy issues, mainly those dealing with the Eukaryotes.

We are currently discussing the best short-term and long-term way to fix this. But we would welcome any input on this.

-Mike

balford · October 17, 2019, 8:57pm

Thanks for the clarification Mike. As a user, it is sometimes difficult to navigate where qiime2 and outside resources begin and end.

For those of us who are interested in the silva/ Eukaryote discussion, how do we get involved?

Cheers,
Betsy

SoilRotifer · October 17, 2019, 9:31pm

Anytime @balford!

Soon, a couple of the QIIME devs will be discussing how to best move forward on this.

I would assume, for now, that posting to this thread should be sufficient. The other devs may qiime-in, if they have other suggestions.

In the interim, do not hesitate to post thoughts or suggestions to this thread. We'll try our best to navigate from here.

SoilRotifer · October 29, 2019, 3:29pm

Hi @balford, and others...

I've uploaded, what I consider a "first-pass" on trying to fix the SILVA taxonomy issues. Particularly those of the Eukaryotes. I've uploaded files on GitHub, for the community to test and play around with. The classifiers, and the files used to make them, are available there.

Right now I'll update the documentation as I have time, things are quite busy at the moment.

Let us know if this is useful. We are working on a more streamlined and cleaner approach to provide these. But I figured I'd pass along this short-term quick-fix for those in need.

Cheers and happy :qiime2:-ing!
Mike

SoilRotifer · November 8, 2019, 9:59pm

Hi all, just an FYI:

Here are new locations for the updated SILVA taxonomy (i.e. Greengenes-like) reference files for both the SSU and LSU data. To save space and bandwidth, these are the raw FASTA and TSV files. So, you’ll have to import and train them yourself. Be sure to read the SILVA License.

Be wary of the species labels. For example, there are a few taxa annotated with a species label that corresponds not to the organism to which the sequence belongs, but from the source material from which the sequence was obtained. Here is an example:

d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae; g__Serratia; s__Oryza_sativa_Indica_Group_(long-grained_rice)
d__Eukaryota; p__Arthropoda; c__Insecta; o__Hemiptera; f__Hemiptera; g__Hemiptera; s__Oryza_sativa_Indica_Group_(long-grained_rice)
As you can see, we have an insect and a bacterial sequence both annotated with the species label Oryza sativa (rice). In most cases the species rank information seems okay, but there are enough issues like the one above, that convinced me to generally be cautious of the species label.

Anyway, I hope these are useful.

-Best wishes!
-Mike