Training naive bayes classifier on SILVA 138.2

Hi everyone,
I’m currently training a naive Bayes classifier on the SILVA 138.2 SSU NR99 database, and I’ve noticed that it takes significantly longer compared to previous versions. For instance:

Training with SILVA 138.1 finished in about 2 hours.
Training with SILVA 138.2 has already taken over 10 hours, and the job still hasn’t complete, requiring me to extend the allocated runtime further.
Here’s my job script:
#!/bin/bash --login
########## SBATCH Lines for Resource Request ##########

#SBATCH --time=10:00:00 # limit of wall clock time - how long the job runs
#SBATCH --nodes=1 # number of different nodes
#SBATCH --ntasks=1 # number of tasks
#SBATCH --cpus-per-task=6 # number of CPUs (or cores) per task
#SBATCH --mem-per-cpu=16G # memory required per allocated CPU (or core)
#SBATCH --mail-user= # email for notifications
#SBATCH --mail-type=ALL # tye of emails: BEGIN, END, FAIL
#SBATCH --job-name qi2-silva # name of the job

########## Command Lines for Job Running ##########

Load the required environment

module purge
conda activate qiime2

Import your reference sequence and taxonomy files into QIIME 2 artifacts

qiime tools import
--type 'FeatureData[Sequence]'
--input-path SILVA_138.2_SSURef_NR99_tax_silva_trunc_dna.fasta
--output-path silva-138.2-sequences.qza

qiime tools import
--type 'FeatureData[Taxonomy]'
--input-path silva_taxonomy.tsv
--output-path silva-138.2-taxonomy.qza

Train the classifier

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads silva-138.2-sequences.qza
--i-reference-taxonomy silva-138.2-taxonomy.qza
--o-classifier silva-138.2-classifier.qza

The first step of importing the files was done successfully, but the second part of the job didn't complete due to time limitations.

Hi @asmaamorsi,

It should not be taking that long. I just constructed my own SILVA 138.2 classifier a few days ago w/o issue. There really is no difference between 138.1 and 138.2 other than updates to the taxonomy. So, there should be no differences in the time it takes to make the classifier.

You can simply run qiime rescript get-silva-data ... to fetch and import the taxonomy for you. See the tutorial which starts with this step here.

I would strongly suggest that you set:

-cpus-per-task=1   # or 2
-mem-per-cpu=32G   # or 48G 

Let us know if these recommendations work. :slight_smile:

Hi,

Thanks for the quick response. I tried using Rescript several time but I kept getting this error message,
Plugin error from rescript:
Parameter 'version' received '138.2' as an argument, which is incompatible with parameter type: Str % Choices('128', '132')¹ | Str % Choices('138')² | Str % Choices('138.1')³
I reran the job, extending the time limit to 15 hours, and it was successfully completed in 12 hours. I'm not sure why it took that long, and I wasn't able to use rescript. For reference, I am using QIIME 2 version 2024.5.0

1 Like

You'll need to install the latest version of QIIME 2 (2024.10) to make use of get-silva-data for SILVA v138.2.

Or simply follow the tutorial, using your current version of QIIME 2, to download all the files required. There are several... click on the The gritty details, menu to reveal the detailed instructions... It should work if you simply replace '138.1' with '138.2'.

-Mike

1 Like

Thanks for the information. I will consider using [QIIME 2 (2024.10) in future analysis. I have another question related to this topic, how can I decide the minimum length when training the classifier on only the V4 region? The expected amplicon size is ~390bp, thus I set the maximum length to 400, but I am not sure how to decide the minimum length.

I would simply follow the instructions for making an amplicon specific classifier. Basically, trim out the amplicon region of interest using your PCR primer sequences. Then you should not need to worry about length trimming.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.