SILVA 16S database

I am trying to replicate the taxonomic classification results of the V3 region of the 16S rRNA provided by another working group that used SILVA as a reference database using Qiime2.

While using kraken2 with the appropriate SILVA 16S database I can obtain practically identical percentages, while on Qiime2 using the "Silva 138 99% OTUs full-length sequences" database I get completely different percentages, while from the "Silva 138 99% OTUs database from 515F / 806R region of sequences "I don't get any classification.

I have seen that the SILVA database can be downloaded and customized with RESCRIPt, however I do not have enough RAM to perform the operation.

What could be the mistake I make?
Which database should I use?


A. C.

Hi @AndreaC, welcome to :qiime2:!

Couple of questions:

  1. Do you know if that working group trained their classifier specific to the V3 region? This could be a potential reason for any differences observed.
  2. I assume they classified the same dataset as you are currently working with?
  3. Which QIIME 2 sequence identification tool did they use for classification? That is, did they use the nàive bayes classifier, vsearch, blast?

The reason why you are seeing lack of classification on your V3 16S rRNA amplicon region, is due to the fact that the Silva (515F / 806R) is specific to the V4 region, which does not overlap at all with your V3 data.

Great to hear that you are trying RESCRIPt! If you have been dereplicating the data, especially after extracting your amplicon region of interest, you should not need too much RAM. How much RAM do you have?


I am trying to replicate the analyzes precisely because the materials and methods provided lack several details, to quote a movie: "i feel i was denied critical need-to-know information!" :sweat_smile:
I have the same .fastq they worked on, so I'm trying to find some results compatible with theirs.

Can the use of nàive bayes classifier vs vsearch vs blast justify a difference in the percentages of over 40%?

At the moment, unfortunately, I work with 32 gb of ram and following the instructions in the following guide, the procedure is killed after a couple of minutes, I guess for lack of ram.

I tried to download the pre trained classifier from this post, but it turns out to be too old of scikit to work on my qiime version.

Do I have to wait for my supervisor to provide the money to do a hardware upgrade and train the database myself, or is there any working pretrained silva 16S? :laughing:

Thanks for your help


Hahaha! I feel your pain! :weary:

To be honest I would not expect that drastic of a difference between tools. Unless they are normalizing the data (i.e. rarefying, etc...) prior to making the plots.

That is typically enough RAM for making amplicon specific classifiers. For a full-length sequence classifier... that may be cutting it close. Also, several steps in the RESCRIPt pipeline can take a while, particularly when building a classifier. That can take a really long time... ranging anywhere from a couple of hours to a couple of days, depending on how large the database is. You should not need that much RAM if you are making an amplicon-specific classifier, though it may take several hours to construct the classifier.

You can simply download the classifier made for that version of :qiime2:. On any of the documentation pages you'll see a drop menu on the upper left. Simply select the version of :qiime2: you have and download the classifier from the data resources page listed on the left. You'll have to keep telling the web site to take to you the content for that version and not the latest version of :qiime2:

You should not need more RAM (unless building a full-length sequence classifier), but you will have to train the classifier yourself. :biking_woman:


I can't figure out which drop-down menus you are referring to.

I read in several posts that "If you're doing 16s sequencing rarely, or this is your first time, a pre-trained, full length classifer is fine." however, I notice significant differences between using a full database versus just one 16s.

For example in this case with kraken

while for the data provided processed with silva database and qiime2, the results are very close to those I get through kraken2 and silva 16S, while if I try to replicate the analysis with qiime2 and the complete Silva database, I get % totally different.

What database should I believe?

I am referring to this drop-menu in the docs:

Note this is specifically in reference to training nàive bayes classifiers, not necessarily Kraken 2 or other tools. In fact, I'd highly recommend reading the following papers about taxonomy assignments, and how to potentially improve them:

Short answer, who knows. :man_shrugging: It really depends on what questions you are trying to answer, and what makes sense given the environment your samples are from.

This all comes down to how the different versions of the databases are prepared. Without knowing how their database was curated (i.e. sequence quality and taxonomic nomenclature chosen) it is difficult to say which is "better". You can easily see how QIIME 2 SILVA databases were prepared by observing the provenance information, as it was made with RESCRIPt. We developed RESCRIPt, so that you can choose your own criteria for making and curating your own database. Thus, if you generally know how the other database was made, then you can try and replicate that processing with RESCRIPt / QIIME 2. :toolbox:

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.