Preparing SILVA132 for QIIME1/2 Use

peterleary · December 15, 2017, 5:10pm

Hi everyone,

SILVA132 just got released, and I'm very excited to use it with some 16S Illumina data I got last week!

It seems some kind souls around here prepare the SILVA files for use with QIIME as found here https://www.arb-silva.de/download/archive/qiime/

I'm following the instructions in these instructions to prepare the 132 files, but I'm getting a bit stuck, at the earlier stages too embarrassingly.

When trying to pick_rep_set.py on the clustered sequences, I keep getting the same KeyError. The error refers to KC716090 - which happens to be denovo0 in my otu_map.txt. The OTU ID is in both my denovo_abundance_sorted.fna and otu map, so I'm not sure why it's unhappy.

If anyone else has been playing with SILVA132, I'd love to hear how you're getting on.

I know this is a QIIME1 script, in a QIIME2 forum, I do hope this isn't considered sacrilege!

I'd also like to say thanks to all the QIIME devs for their impeccable craftsmanship!

Peter

ebolyen · December 15, 2017, 5:13pm

Hey @peterleary!

It kind of is , but since the goal is to get SILVA132 into a compatible format, I think it makes sense to discuss it here!

On that note, I'm not at all familiar with the process, so hopefully someone else can jump in here. Maybe there are easier ways to do this now, or perhaps QIIME 2 could be adapted to support SILVA out of the box better.

peterleary · December 15, 2017, 5:43pm

So I figured out that when doing the pick_rep_set.py script, I'm supposed to use the otu_map.txt from pick_otus.py and the original SILVA132 fasta file, rather than the denovo_abundance_sorted.fna that's also produced from pick_otus.py. Doh. I think I've made some progress though.

I followed the instructions from the SILVA128 notes to get a 16S 99% rep set fasta file (as per the "creation of representative sequence files" and then "Splitting fasta files by domain" sections) – then imported this into QIIME2.

Next, I followed the instructions to create a taxonomy map (as per the "Taxonomy mapping file creation" and then "Parsing and splitting taxonomy mapping files" sections). It would not parse to 7 levels (from 14), and I didn't create a majority or consensus taxonomy map yet (namely because I was too eager to see if it would even work). My parsed 16S taxonomy map imported into QIIME2.

I then extracted the reads and trained the classifier, then classified the features. And, it seems to work?

I'm comparing the same data to a 99% SILVA128 classifier I made, and there's some differences, namely in the classification of archaea. But the SILVA128 classifier was made with a majority taxonomy map, so it's not quite like-for-like.

The author of the SILVA128 notes and the authors of the scripts have made it possible for me to get to this stage, but even then it's not been easy for an amateur like me! Now I have my head around the principles of the task, it seems pretty straightforward. Perhaps just a more detailed walkthrough would suffice. It's a job that only really needs doing once, and the folks who have done it previously have saved us more trouble than I think people like me realised (as is always the way!) All the scripts used are on GitHub so maybe it's just a case of making them a plugin?

colinbrislawn · December 17, 2017, 12:16am

Hey Peter,

Thanks for tackling this! I'm glad folks are making use of multiple databases.

I want to cc William A Walters (Tony), who wrote that guide and all those scripts. Maybe he has some advice, or an update on SILVA compatibility.

Colin

matpich · December 18, 2017, 3:13pm

Hi @ebolyen,
Congrats on your successful install @peterleary.
I am following the same path as @peterleary, but I am quite reluctant to install qiime1 for the few steps that rely on it. Yet I really appreciate the benefit of the consensus-based annotation that these steps help obtain.
Before I do something I could regret all my life (i.e. install Qiime1 ), could you let me know if there is a plan to release a Qiime2-ready SILVA132 db sometime soon?
Many thanks!

William · December 18, 2017, 3:16pm

Hello Peter,

I'm glad the database creation notes were helpful. Unfortunately, I haven't been active in the QIIME 2 development side of things. I'll take a look and see if my scripts (and Mike Robeson's, which handle a lot of the taxonomy parsing) can be readily modified to a plugin.

The parsing to 7 levels applies to the eukaryotic side of taxonomy-I don't think I've seen any archaea/bacteria with more than 7 levels. Does the script throw an error when you try to create the 7-level taxonomy? If not, what do the input and output taxonomy strings look like?

peterleary · December 18, 2017, 5:41pm

Hi Tony,

Thanks for replying!

With regards to parsing to 7 levels: I was trying to parse the complete taxonomy map, including 18S. I did look at the 16S-only file and as far as I could see, they all only had 7 levels. I'm not looking for 18S. But someone will need the 18S, so I've been trying again.

I follow the instructions from "Taxonomy mapping file creation", including: 1. prep_silva_data.py; 2. parse_nonstandard_chars.py; 3. then when I go to use parse_to_7_levels.py, I get the error:

File "/Users/peterleary/pprospector/bin/parse_to_7_taxa_levels.py", line 30, in <module>
    last_named_level = taxa[taxa_depth - 1].split('__')[1]
IndexError: list index out of range

This environment is running Python 2.7, FWIW.

I'm also struggling to create a consensus taxonomy file, but I am not so sure that I am just inputting the wrong files. Are you able to clarify exactly what files are required as inputs, please?

Thanks very much!
Peter

gregcaporaso · December 18, 2017, 6:28pm

Hi all,
A few comments on various parts of this conversation:

We would like to, but on the QIIME 2 side we do rely on those QIIME-compatible Silva files that @William has typically been responsible for creating.

@William, I would be happy to try to help you and @SoilRotifer with this. It would be great if we could figure out how to automate this process, which would be easier if we had this functionality accessible in a QIIME 2 plugin.

As a side note, if your current processing workflow requires QIIME 1 for some steps, it's possible to install QIIME 1 and QIIME 2 side-by-side in different conda environments. If you already have QIIME 2 installed, you should be able to follow the QIIME 1 conda installation steps to get a working QIIME 1 base install. If the steps you need from QIIME 1 are focused on OTU clustering, QIIME 2 does now offer similar functionality (see here).

Please let me know what I can do to help in this process. It would be really great to get some trained classifiers for Silva 132 posted for use with QIIME 2.

William · December 18, 2017, 7:35pm

Hello Peter,

I've also emailed the maintainers of SILVA about this, as they had wanted to take over the creation of the QIIME-compatible databases at one point (although the response might be delayed at this point due to the holidays).

In any case, I think the order should be this:
prep_silva_data.py (on the 132 fasta file from SILVA that has taxonomy strings in the labels).
Then, the output taxonomy file from this can first be checked for non-ASCII or * characters with parse_nonstandard_chars.py, and then the cleaned output of this is used as input for:
prep_silva_taxonomy_file.py
The file from prep_silva_taxonomy_file.py should have the number of levels equal to the maximum present in the SILVA taxonomy. This file is used as input to parse_to_7_taxa_levels.py, which shouldn't change the archaea/bacteria taxonomies (apart from cutting off the empty levels after the 7th), but should change a lot of the eukaryotic ones.

Which file were you used as input for parse_to_7_taxa_levels.py? You might have used the output from prep_silva_data.py, rather than the downstream one from prep_silva_taxonomy_file.py.

The creation of the majority and consensus taxonomy files utilizes the following files (listed in order of input, and it's the same input files for each of the scripts):

The final (either full or 7 level) taxonomy mapping file created from the previously discussed step.
The representative sequence set (i.e., when creating 99%, 97%, etc. reference datasets, the fasta file created with pick_rep_set.py on these results-the consensus/majority steps assume all of the OTU picking and creation of rep set files has already been done). This is just used to get the order of the labels, so that, for convenience, the taxonomy file will be in the same order as the fasta sequence file.
The OTU mapping file. This might be creating the confusion-it's the .txt file that is created when running pick_otus.py that has the tab-delimited OTU identifier and the identifiers of all sequences that fell into that OTU. The exact name of the file depends upon name of the input file, but it will be in the output folder of pick_otus.py so there's not many files one has to search through to find it.
The output taxonomy file, consensus or majority.

Thanks for spending the time on this Peter. We probably should work on setting up an automated workflow for creating and testing these (Determining the memory usage for taxonomic classifiers/OTU picking are helpful for users as well)-although we do have to make sure it's SILVA that hosts whatever the derived files are-they are serious about maintaining control of the data since it's only free for academic use, and they understandably want people to cite their work.

-Tony

SoilRotifer · December 19, 2017, 6:10pm

Hi @William & @gregcaporaso, I think having a plugin to ease the process of generating QIIME-compatible SILVA files would be ideal.

I'm glad to see how far along @peterleary has been able to get in generating a SILVA 132 DB. Nice work!

A while ago I've tried my best to combine both @William's and my notes here, along with some links to simple tricks and related QIIME1 posts. But much of this could more streamlined. I'd be happy to help.

-Mike

peterleary · January 5, 2018, 8:40pm

Hi all,

Happy new year!

Thanks for the really helpful and interesting replies to this. It seems like a plugin is a popular idea. Unfortunately, I'm very much an end user rather than a developer, so I won't be much use

@William – I followed your instructions and I managed to create a consensus taxonomy file. As is always the case, the fault was with me, the user, for failing to read the Notes file properly, and missing the prep_silva_taxonomy_file.py step. But also your numbered points on creating the consensus map were enormously helpful so thank you very much for taking the time to write it. And of course thank you to @SoilRotifer and @William for the scripts!

Something I did notice however, is that when I separated the taxonomy by domain, I lost the D_0__ from the prep_silva_taxonomy_file step, so had to put the separated taxonomy maps through that again.

@SoilRotifer – Your notes look very useful. I'll probably make my own, similar, notes based on everything, hopefully in a format well suited to folks like me who are very, very new to this stuff!

This is presumably going to be an issue every time SILVA release a new database, and while I've now successfully navigated my way through @William and @SoilRotifer's notes, we'll have to go through all this again in 6-12 months! I suppose, ideally, QIIME would have the plugin and SILVA prepare/host the files? Would QIIME be able to host some pre-prepared classifiers say for the 16S V4?

Thanks again, everyone!
Peter

thermokarst · January 9, 2018, 3:57pm

Hello @peterleary!

Yep! I think the last few times we have built out the trained classifiers it has been an oversite on our part to update to the latest SILVA release. Stay tuned for more up-to-date revisions in the near future! I think we will continue to provide the Full DB, as well as the 515 806 region, too. Thanks!

system · February 9, 2018, 10:08pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.