Hi Qiime folks,
I am having trouble finding the code that the qiime2 developers used to collapse the silva arb down to the classifier friendly 7 levels related to the recent dada2 discussion. I suspect it’s posted somewhere, I’m just not finding it using the usual avenues.
The QIIME 2 developers are not involved in development or maintenance of SILVA. As far as I know the SILVA developers release the "QIIME formatted" taxonomy and sequences on the SILVA website and I believe the README file contained within that directory explains what formatting has been performed.
Perhaps I should provide additional clarification. I am attempting to find the documentation for the code that was used to collapse the full silva taxa down to seven levels found in this page Archive. The readme file states that qiime should be contacted.
The link that @Nicholas_Bokulich provided earlier, refers to this information. However, I should have mentioned more explicitly, that the nitty-gritty details are specifically contained within the Silva_119_notes.txt file of the Silva v119 release and not the general README file of the SILVA qiime folder.
And just to clarify my comments in the context of that passage: that must be a relic of the QIIME 1 days… the QIIME 2 developers do not do any of this reformatting, we leave it up to SILVA to manage their own releases.
That information is a bit dated, from the QIIME 1 days. But I think that formatting approach generally holds for the later releases too. Actually, I also forgot to mention if you download the Silva 132 release, there is a Silva_132_notes.txt file that describes how the reference database was constructed for QIIME.
Thanks Nick and Mike for your help with this! I found what I am looking for in the parse_to_7_taxa_levels.py script mentioned in the Silva_132_notes.txt. I’m going to try to summarize what I think I understand here so it’s documented in the forum for future folks.
When silva has a new release (say the upcoming 138 release, for example), the silva folks place a zip folder in https://www.arb-silva.de/download/archive/qiime containing among other things, the full silva database, the 7 layer 16/18S, the 18S 7 layer etc.
The basic rules of full silva down to 7 levels
-if 7 levels keep 7 levels
-if less than 7 levels, keep all levels and fill missing levels with “unclassified”
-if greater than 7, keep only the top 4 levels and the bottom 3 levels
You generally have it. But the 7-levels may not particularly work well for the Eukaryotes (unless you make your own modifications), which is why we provide other file versions with the full taxonomy strings.
But one point of clarification, usually a member of the QIIME team downloads the raw data from SILVA, and carries out the formatting pipeline as outlined in the Silva_132_notes.txt file. After this is completed, we work with SILVA to make these files available to the community.
Also, to give proper credit the parse_to_7_taxa_levels.py was written by @William
In the post I linked / referred to earlier, we think we have a way of alleviating these taxonomy issues, mainly those dealing with the Eukaryotes.
We are currently discussing the best short-term and long-term way to fix this. But we would welcome any input on this.
I've uploaded, what I consider a "first-pass" on trying to fix the SILVA taxonomy issues. Particularly those of the Eukaryotes. I've uploaded files on GitHub, for the community to test and play around with. The classifiers, and the files used to make them, are available there.
Right now I'll update the documentation as I have time, things are quite busy at the moment.
Let us know if this is useful. We are working on a more streamlined and cleaner approach to provide these. But I figured I'd pass along this short-term quick-fix for those in need.
Here are new locations for the updated SILVA taxonomy (i.e. Greengenes-like) reference files for both the SSU and LSU data. To save space and bandwidth, these are the raw FASTA and TSV files. So, you’ll have to import and train them yourself. Be sure to read the SILVA License.
Be wary of the species labels. For example, there are a few taxa annotated with a species label that corresponds not to the organism to which the sequence belongs, but from the source material from which the sequence was obtained. Here is an example:
d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae; g__Serratia; s__Oryza_sativa_Indica_Group_(long-grained_rice)
d__Eukaryota; p__Arthropoda; c__Insecta; o__Hemiptera; f__Hemiptera; g__Hemiptera; s__Oryza_sativa_Indica_Group_(long-grained_rice)
As you can see, we have an insect and a bacterial sequence both annotated with the species label Oryza sativa (rice). In most cases the species rank information seems okay, but there are enough issues like the one above, that convinced me to generally be cautious of the species label.