Hello Peter,
I’ve also emailed the maintainers of SILVA about this, as they had wanted to take over the creation of the QIIME-compatible databases at one point (although the response might be delayed at this point due to the holidays).
In any case, I think the order should be this:
prep_silva_data.py (on the 132 fasta file from SILVA that has taxonomy strings in the labels).
Then, the output taxonomy file from this can first be checked for non-ASCII or * characters with parse_nonstandard_chars.py, and then the cleaned output of this is used as input for:
prep_silva_taxonomy_file.py
The file from prep_silva_taxonomy_file.py should have the number of levels equal to the maximum present in the SILVA taxonomy. This file is used as input to parse_to_7_taxa_levels.py, which shouldn’t change the archaea/bacteria taxonomies (apart from cutting off the empty levels after the 7th), but should change a lot of the eukaryotic ones.
Which file were you used as input for parse_to_7_taxa_levels.py? You might have used the output from prep_silva_data.py, rather than the downstream one from prep_silva_taxonomy_file.py.
The creation of the majority and consensus taxonomy files utilizes the following files (listed in order of input, and it’s the same input files for each of the scripts):
- The final (either full or 7 level) taxonomy mapping file created from the previously discussed step.
- The representative sequence set (i.e., when creating 99%, 97%, etc. reference datasets, the fasta file created with pick_rep_set.py on these results-the consensus/majority steps assume all of the OTU picking and creation of rep set files has already been done). This is just used to get the order of the labels, so that, for convenience, the taxonomy file will be in the same order as the fasta sequence file.
- The OTU mapping file. This might be creating the confusion-it’s the .txt file that is created when running pick_otus.py that has the tab-delimited OTU identifier and the identifiers of all sequences that fell into that OTU. The exact name of the file depends upon name of the input file, but it will be in the output folder of pick_otus.py so there’s not many files one has to search through to find it.
- The output taxonomy file, consensus or majority.
Thanks for spending the time on this Peter. We probably should work on setting up an automated workflow for creating and testing these (Determining the memory usage for taxonomic classifiers/OTU picking are helpful for users as well)-although we do have to make sure it’s SILVA that hosts whatever the derived files are-they are serious about maintaining control of the data since it’s only free for academic use, and they understandably want people to cite their work.
-Tony