I'm trying to understand which files in the SILVA 132 QIIME-compatible database (i.e. the Silva_132_release.zip directory available for download here) can be used for taxonomic classification of 16S rRNA ASVs in QIIME 2, and also how those files were constructed from the raw SILVA data. (If anyone's wondering, I know that 132 isn't the most recent release of SILVA, but it's the release I happen to be interested in.)
The Silva_132_release/SILVA_132_QIIME_release directory contains the following subdirectories:
It also contains a Silva_132_notes.txt file, which supposedly explains all of these subdirectories and their files, but I'm quite new to bioinformatics, so a lot of the explanations are over my head.
feature-classifier tutorial states that two elements are required for training a classifier using the
feature-classifier plugin: "reference sequences" and "the corresponding taxonomic classifications".
Based on what @jairideout says in this forum post about Greengenes, it would seem that the rep_set and taxonomy directories are the directories I need for taxonomic classification (i.e. rep_set for the reference sequences, and taxonomy for the corresponding taxonomic classifications). And because I'm only interested in 16S sequences, I assume I need the rep_set/rep_set_16S_only directory and the taxonomy/16S_only directory. Is this correct?
Both the rep_set/rep_set_16S_only and taxonomy/16S_only directories have the following subdirectories:
What follows is my best attempt to understand very broadly how the files contained in these subdirectories were prepared, based on the contents of the Silva_132_notes.txt file. If someone could check my understanding and correct any errors, I would be very grateful!
The 'rep_set' subdirectories:
Each of the four rep_set subdirectories contains a single file: silva_132_90_16S.fna, silva_132_94_16S.fna, silva_132_97_16S.fna and silva_132_99_16S.fna, respectively.
If I understand correctly, the creation of these .fna files is primarily described in the Filtering raw fasta file, creation of representative sequence files section of the Silva_132_notes.txt file.
Here are some fragmented quotes from the Silva_132_notes.txt file:
- "The full aligned SSU sequence from Silva with taxonomy strings in the fasta comments was downloaded..."
- "...2090668 sequences..."
- "...convert U characters to T characters, and remove gaps..."
- "...1710544 [sequences] after dereplication..."
- "...sorted by length..."
- "...clustering at 99%, 97%, 94%, 90%, and 80% identities..."
- "Total number of sequences (all domains) for each clustering identity:
- "Splitting ... by domain"
My interpretation (please correct if wrong):
Wikipedia tells me that SILVA is a database of both small subunit (SSU; 16S/18S) and large subunit (LSU; 23S/28S) ribosomal RNA (rRNA) sequences. In preparing the SILVA 132 QIIME-compatible database, full 16S and 18S rRNA sequences – each labelled as belonging to a particular taxon – were downloaded from SILVA. All uracil (U) characters were converted to thymine (T) characters (i.e. the rRNA sequences were converted to DNA sequences), and gaps were removed. The sequences were dereplicated (i.e. non-unique sequences were removed). The remaining sequences were sorted by length, and clustered at 99%, 97%, 94%, 90%, and 80% sequence identity. And finally, the 16S sequences and 18S sequences were separated from one another.
If this is roughly accurate, then are the .fna files in the subdirectories of rep_set/rep_set_16S_only basically lists of unique representative sequences from OTUs generated by clustering the processed SILVA sequences at different percentages of sequence identity?
The pretrained Naive Bayes classifiers on the QIIME 2 'Data resources' page seem to have been trained on 99% OTUs, and @Nicholas_Bokulich seems to recommend the use of 99% OTUs in this forum post, so it seems like I should use the silva_132_99_16S.fna as a source of reference sequences for training the classifier. But is this okay even if I want to classify ASVs derived from DADA2 output?
Also, why must the SILVA sequences be clustered at all? Why not just use the sequences as they are?
If I understand correctly, the raw sequences downloaded from SILVA have taxonomy strings. For example, if I peek at the description lines (i.e. lines starting with '>') in the raw_data/initial_reads_SILVA132.fna file, I see ...
>GY187501.2.1421 Bacteria;Epsilonbacteraeota;Campylobacteria;Campylobacterales;Helicobacteraceae;Helicobacter;unidentified >GY194060.4884.6412 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;unidentified >AC201869.46386.47908 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Candidatus Regiella;Candidatus Regiella insecticola
But if I peek at the description lines in the rep_set/rep_set_16S_only/99/silva_132_99_16S.fna file, I see ...
>AB302407.1.2962 >KU725476.45629.48552 >KU725475.45598.48520
In other words, the taxonomy strings have been removed from the sequences. Is there a reason for this?
One possible explanation I can think of is that if the raw_data/initial_reads_SILVA132.fna sequences are clustered into OTUs in the rep_set/rep_set_16S_only/99/silva_132_99_16S.fna file, and the names of the representative OTU sequences are somehow derived from the raw data (which they seem to be), then it makes sense to remove the taxonomy strings, because, for example, although the raw sequence named >AB302407.1.2962 in the raw_data/initial_reads_SILVA132.fna file might have had a taxonomic string of ...
>AB302407.1.2962 Archaea;Crenarchaeota;Thermoprotei;Thermoproteales;Thermoproteaceae;Pyrobaculum;Pyrobaculum sp. M0H
... the representative OTU sequence named >AB302407.1.2962 in the rep_set/rep_set_16S_only/99/silva_132_99_16S.fna file could perhaps represent several raw sequences (if I'm not mistaken), and maybe not all of those raw sequences had the exact same taxonomic classification in the raw data. Is that about right?
The 'taxonomy/16S_only' subdirectories:
Each of the taxonomy/16S_only subdirectories contains seven files named as follows:
In the Silva_132_notes.txt file, there is a section on Consensus and Majority Taxonomies, which specifies the difference between these two:
A user of the Silva119 data pointed out that the taxonomy with the SILVA119 release is based only upon the taxonomy string of the representative sequence for the cluster of reads, which could lead to incorrect confidence in taxonomy assignments at the fine level (genus/species). To address this, I have endeavoured to create taxonomy strings that are either consensus (all taxa strings must match for every read that fell into the cluster) or majority (greater than or equal to 90% of the taxonomy strings for a given cluster). If a taxonomy string fails to be consensus or majority, then it becomes ambiguous, moving up the levels of taxonomy until consensus/majority taxonomy strings are met.
- "...consensus (all taxa strings must match for every read that fell into the cluster) ..."
- "... majority (greater than or equal to 90% of the taxonomy strings for a given cluster) ..."
I think I understand the difference – assuming the word 'cluster' here corresponds to the same process of generating OTUs by which the rep_set subdirectories were prepared. Does it?
Apologies for the ungodly length of this post, and the potentially very noob questions. Perhaps this kind of 'thinking out loud' post will be somehow useful to other noobs in the future. Anyway, big thanks to the team for developing this amazing tool and for maintaining this very useful forum. You are appreciated, and we are grateful! Keep up the great work!