Questions about the SILVA 132 QIIME-compatible database, and using it for taxonomic classification of 16S rRNA amplicons

KQUB · August 24, 2021, 9:45am

Hi there,

I'm trying to understand which files in the SILVA 132 QIIME-compatible database (i.e. the Silva_132_release.zip directory available for download here) can be used for taxonomic classification of 16S rRNA ASVs in QIIME 2, and also how those files were constructed from the raw SILVA data. (If anyone's wondering, I know that 132 isn't the most recent release of SILVA, but it's the release I happen to be interested in.)

The Silva_132_release/SILVA_132_QIIME_release directory contains the following subdirectories:

core_alignment
raw_data
rep_set
rep_set_aligned
taxonomy
trees

It also contains a Silva_132_notes.txt file, which supposedly explains all of these subdirectories and their files, but I'm quite new to bioinformatics, so a lot of the explanations are over my head.

The feature-classifier tutorial states that two elements are required for training a classifier using the feature-classifier plugin: "reference sequences" and "the corresponding taxonomic classifications".

Based on what @jairideout says in this forum post about Greengenes, it would seem that the rep_set and taxonomy directories are the directories I need for taxonomic classification (i.e. rep_set for the reference sequences, and taxonomy for the corresponding taxonomic classifications). And because I'm only interested in 16S sequences, I assume I need the rep_set/rep_set_16S_only directory and the taxonomy/16S_only directory. Is this correct?

Both the rep_set/rep_set_16S_only and taxonomy/16S_only directories have the following subdirectories:

90
94
97
99

What follows is my best attempt to understand very broadly how the files contained in these subdirectories were prepared, based on the contents of the Silva_132_notes.txt file. If someone could check my understanding and correct any errors, I would be very grateful!

The 'rep_set' subdirectories:

Each of the four rep_set subdirectories contains a single file: silva_132_90_16S.fna, silva_132_94_16S.fna, silva_132_97_16S.fna and silva_132_99_16S.fna, respectively.

If I understand correctly, the creation of these .fna files is primarily described in the Filtering raw fasta file, creation of representative sequence files section of the Silva_132_notes.txt file.

Here are some fragmented quotes from the Silva_132_notes.txt file:

"The full aligned SSU sequence from Silva with taxonomy strings in the fasta comments was downloaded..."
"...2090668 sequences..."
"...convert U characters to T characters, and remove gaps..."
"...dereplicated..."
"...1710544 [sequences] after dereplication..."
"...sorted by length..."
"...clustering at 99%, 97%, 94%, 90%, and 80% identities..."
"Total number of sequences (all domains) for each clustering identity:
99% 412168
97% 194822
94% 94835
90% 40215
80% 5539"
"Splitting ... by domain"

My interpretation (please correct if wrong):

Wikipedia tells me that SILVA is a database of both small subunit (SSU; 16S/18S) and large subunit (LSU; 23S/28S) ribosomal RNA (rRNA) sequences. In preparing the SILVA 132 QIIME-compatible database, full 16S and 18S rRNA sequences – each labelled as belonging to a particular taxon – were downloaded from SILVA. All uracil (U) characters were converted to thymine (T) characters (i.e. the rRNA sequences were converted to DNA sequences), and gaps were removed. The sequences were dereplicated (i.e. non-unique sequences were removed). The remaining sequences were sorted by length, and clustered at 99%, 97%, 94%, 90%, and 80% sequence identity. And finally, the 16S sequences and 18S sequences were separated from one another.

If this is roughly accurate, then are the .fna files in the subdirectories of rep_set/rep_set_16S_only basically lists of unique representative sequences from OTUs generated by clustering the processed SILVA sequences at different percentages of sequence identity?

The pretrained Naive Bayes classifiers on the QIIME 2 'Data resources' page seem to have been trained on 99% OTUs, and @Nicholas_Bokulich seems to recommend the use of 99% OTUs in this forum post, so it seems like I should use the silva_132_99_16S.fna as a source of reference sequences for training the classifier. But is this okay even if I want to classify ASVs derived from DADA2 output?

Also, why must the SILVA sequences be clustered at all? Why not just use the sequences as they are?

If I understand correctly, the raw sequences downloaded from SILVA have taxonomy strings. For example, if I peek at the description lines (i.e. lines starting with '>') in the raw_data/initial_reads_SILVA132.fna file, I see ...

>GY187501.2.1421 Bacteria;Epsilonbacteraeota;Campylobacteria;Campylobacterales;Helicobacteraceae;Helicobacter;unidentified
>GY194060.4884.6412 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;unidentified
>AC201869.46386.47908 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Candidatus Regiella;Candidatus Regiella insecticola

...etc.

But if I peek at the description lines in the rep_set/rep_set_16S_only/99/silva_132_99_16S.fna file, I see ...

>AB302407.1.2962
>KU725476.45629.48552
>KU725475.45598.48520

...etc.

In other words, the taxonomy strings have been removed from the sequences. Is there a reason for this?

One possible explanation I can think of is that if the raw_data/initial_reads_SILVA132.fna sequences are clustered into OTUs in the rep_set/rep_set_16S_only/99/silva_132_99_16S.fna file, and the names of the representative OTU sequences are somehow derived from the raw data (which they seem to be), then it makes sense to remove the taxonomy strings, because, for example, although the raw sequence named >AB302407.1.2962 in the raw_data/initial_reads_SILVA132.fna file might have had a taxonomic string of ...

>AB302407.1.2962 Archaea;Crenarchaeota;Thermoprotei;Thermoproteales;Thermoproteaceae;Pyrobaculum;Pyrobaculum sp. M0H

... the representative OTU sequence named >AB302407.1.2962 in the rep_set/rep_set_16S_only/99/silva_132_99_16S.fna file could perhaps represent several raw sequences (if I'm not mistaken), and maybe not all of those raw sequences had the exact same taxonomic classification in the raw data. Is that about right?

The 'taxonomy/16S_only' subdirectories:

Each of the taxonomy/16S_only subdirectories contains seven files named as follows:

consensus_taxonomy_7_levels.txt
consensus_taxonomy_all_levels.txt
majority_taxonomy_7_levels.txt
consensus_taxonomy_all_levels.txt
raw_taxonomy.txt
taxonomy_7_levels.txt
taxonomy_all_levels.txt

Several forum posts seem to point to using either consensus_taxonomy_7_levels.txt or majority_taxonomy_7_levels.txt for taxonomic classification in QIIME 2.

In the Silva_132_notes.txt file, there is a section on Consensus and Majority Taxonomies, which specifies the difference between these two:

A user of the Silva119 data pointed out that the taxonomy with the SILVA119 release is based only upon the taxonomy string of the representative sequence for the cluster of reads, which could lead to incorrect confidence in taxonomy assignments at the fine level (genus/species). To address this, I have endeavoured to create taxonomy strings that are either consensus (all taxa strings must match for every read that fell into the cluster) or majority (greater than or equal to 90% of the taxonomy strings for a given cluster). If a taxonomy string fails to be consensus or majority, then it becomes ambiguous, moving up the levels of taxonomy until consensus/majority taxonomy strings are met.

In short:

"...consensus (all taxa strings must match for every read that fell into the cluster) ..."
"... majority (greater than or equal to 90% of the taxonomy strings for a given cluster) ..."

I think I understand the difference – assuming the word 'cluster' here corresponds to the same process of generating OTUs by which the rep_set subdirectories were prepared. Does it?

Apologies for the ungodly length of this post, and the potentially very noob questions. Perhaps this kind of 'thinking out loud' post will be somehow useful to other noobs in the future. Anyway, big thanks to the :qiime2: team for developing this amazing tool and for maintaining this very useful forum. You are appreciated, and we are grateful! Keep up the great work!

Cheers,

Kevin

SoilRotifer · August 24, 2021, 6:08pm

Great to see someone dive into the weeds!

As one of the contributors to this pipeline, let me see if I can help you out.

Correct. However, I'd recommend keeping all of the reference sequences, including the eukaryotes, in the files. It allows you to more robustly identify off-target sequences, i.e. non-bacterial and non-archaeal sequences. Often the different versions of the database, i.e. 16S only, was more for practical reasons to reduce the memory and storage footprint of the reference database.

Correct.

Looks like you have it all figured out.

Yep!

Yes. the 99% helps removes some extra noise in the reference data, and reduce the size of the data set. We currently (see below) prefer to use the SILVA NR99 dataset for the reasons outlined here.

We'll come back to this. We have a treat for you at the end.
But, great question! You get a !
Again, traditionally, clustering was a way to remove noisy sequence data, and reduce the size of the reference set to be used for taxonomic classification. Back in the day, many researchers did not have access to good computing resources that had the memory and cpu power to construct classifiers.

Just to reduce the file size. Once we have the taxonomy file there is no need to keep that redundant information.

Correct. The original old code referenced in the README is here. A description on the labeling format is available here:

However, we have since taken a different approach to parsing taxonomy. See the end of this post.

Yep.

Here, the "cluster" is the set of sequences that fall into an OTU. That is, a representative sequence. We are forming the consensus / majority taxonomy by taking into account all of the lineages that fall within that OTU / representative sequence, and collapsing the taxonomy that is, hopefully, a good representation of all the sequences that fall into that cluster.

Not a problem. Not many have dug deep into the process of generating these files. I for one appreciate your interest! Thank you!

Okay... so that bit of I was promising... Instead of re-working your way through that old pipeline, you can simply make use RESCRIPt, to make your own SILVA reference database, even for version 132! You can work through the tutorials and curate the reference data the way you'd like. That is you can simply run this command:

qiime rescript get-silva-data \
    --p-version 132 \
    --p-target SSURef_NR99 \
    --p-include-species-labels \
    --output-dir silva-132

and then follow the rest of the SILVA tutorial. I hope you'll find that it is superior to what we've done in the past. For example, how we parse taxonomy, etc... Although we provide a few ready-to-use files and classifiers, you can certainly go ahead and make your own files, curated the way you'd like . The goal of RESCRIPt is to make life a little easier for those of us interested in constructing and curating our own little piece of reference database heaven.

As a little history... the old SILVA parsing code, that I linked above, was eventually updated to this, and then found a home in RESCRIPt.

Take it for a spin and let us know how it works out.

KQUB · August 27, 2021, 11:00am

Thanks so much for your detailed answer, @SoilRotifer. This was super useful!

system · September 27, 2021, 5:00pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.