rescript dereplicate timeout error

Dear QIIME2 team and @SoilRotifer

I’m currently working on the analysis of soil bacterial communities using the 16S rRNA gene region, amplified with primers 341F and 806R. I have completed the DADA2 denoising step and subsequently clustered the ASVs into OTUs at 97% similarity using VSEARCH. My current OTU representative sequence file contains 72,077 OTUs.

I am now attempting to assign taxonomy using the SILVA 138.2 NR99 database, which I have processed through RESCRIPt using the following dereplication command:

qiime rescript dereplicate \
    --i-sequences silva-138.2-ssu-nr99-seqs-filt.qza  \
    --i-taxa silva-138.2-ssu-nr99-tax.qza \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-138.2-ssu-nr99-seqs-derep-uniq.qza \
    --o-dereplicated-taxa silva-138.2-ssu-nr99-tax-derep-uniq.qza

To assign taxonomy, I am using qiime feature-classifier classify-consensus-vsearch, but the job has failed multiple times on our HPC system after long run times (attempted 3, 5, and 6 days). The job consistently fails due to time-out errors. Below is my SLURM job script:

#!/bin/bash
#SBATCH --job-name=Bac_qiime_ref
#SBATCH --output=qiime_ref_%j.log
#SBATCH --error=qiime_ref_%j.err
#SBATCH --time=240:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=150G
#SBATCH --partition=Standard
#SBATCH --mail-user=salma.sarker@anu.edu.au
#SBATCH --mail-type=ALL

# Activate Conda
eval "$(conda shell.bash hook)"
conda activate /mnt/data/dayhoff/home/u7410018/.conda/envs/qiime2-amplicon-2024.2

# QIIME2 VSEARCH taxonomy assignment
qiime feature-classifier classify-consensus-vsearch \
  --i-query /mnt/data/dayhoff/home/u7410018/1-360/soil_ITS/raw_Soil_data/Bacteria/16S_otu_sequences_97.qza \
  --i-reference-reads /mnt/data/dayhoff/home/u7410018/1-360/soil_ITS/raw_Soil_data/Bacteria/silva-138.2-ssu-nr99-seqs-derep-uniq.qza \
  --i-reference-taxonomy /mnt/data/dayhoff/home/u7410018/1-360/soil_ITS/raw_Soil_data/Bacteria/silva-138.2-ssu-nr99-tax-derep-uniq.qza \
  --p-perc-identity 0.90 \
  --p-maxaccepts 1 \
  --p-threads 4 \
  --o-classification /mnt/data/dayhoff/home/u7410018/1-360/soil_ITS/raw_Soil_data/Bacteria/16S_qiime2taxonomy.qza \
  --o-search-results /mnt/data/dayhoff/home/u7410018/1-360/soil_ITS/raw_Soil_data/Bacteria/16S_search_qiime2results.qza

My questions:

  1. Could the failure be due to the large number of OTUs (72,077), or the size of the SILVA database?
  2. Would reducing --p-perc-identity or modifying --p-maxaccepts help improve runtime?
  3. Is there a more efficient way to speed up this process (e.g., increasing threads, splitting queries, or using a pre-trained classifier)?
  4. Would switching to classify-sklearn or training a Naive Bayes classifier help in my case?
  5. Any advice for optimizing memory or CPU usage for this kind of task on HPC?

I appreciate any insights or suggestions to resolve this issue.

Best regards,
Salma Sarker
PhD Candidate, ANU
salma.sarker@anu.edu.au

Hi @Salma_Sarker ,

Both. These are both very large. You probably do not want to reduce the size of your reference database, but you could attempt to filter out OTUs to reduce complexity. 72k is a lot! Many are probably noisy and/or very low abundance so could be worth dropping, depending on your experimental goals.

Lowering maxaccepts probably would reduce runtime by a little bit, but it looks like you already set this to 1, so you cannot go lower than 1. perc-identity would probably not impact runtime.

Yes increasing threads would reduce runtime, if you have the resources for this. The pre-trained classifier would probably be faster as well, but you have a very large query so that is the main reason for the long runtime.

Good luck!

1 Like

Here's another settings to consider!

--maxrejects
Int% Range(1, None) | Str% Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']


This setting works just like maxaccepts, but for reads that don't match the database.

Because this uses all as the default, if there is nothing in the database >90% similar, then the entire database will be searched! This is extremely slow!

It looks like this data may have triggered an edge case, which will make the uncommon features take a very long time to search. :snail:

I recommend using medium or large value for maxrejects, like 100 or 1000.
Vsearch uses a value of 32 by default, so this is a few orders of magnitude larger!

This will illuminate this edge case but there are still many features to search! Let me know what kind of speedup you get from adding
--p-maxrejects 100

2 Likes

Dear @colinbrislawn and @Nicholas_Bokulich

Thank you so much for your kind support and helpful suggestions regarding my taxonomy assignment issue using classify-consensus-vsearch.

Following your advice, I performed the following steps:

  1. Filtered the OTU table to remove low-abundance features (less than 10 total reads) and those found in fewer than 2 samples, reducing the number of OTUs from 72,077 to 8,685.
  2. Included --p-maxrejects 100 in the VSEARCH classification step to improve performance for poor or no matches.
  3. Increased the number of threads to 6, matching SLURM CPU allocation.

After implementing these optimizations, I re-ran the job on our HPC system (150GB memory), and I’m happy to report that the taxonomy assignment was successfully completed within a few minutes — a dramatic improvement over the previous timeouts after several days.

Upon reviewing the results:

  • 6,944 OTUs were assigned to Kingdom Bacteria
  • 1,650 OTUs remained unassigned
  • All assignments were at consensus level 1
    However, none of the OTUs were resolved down to species level

I would really appreciate any suggestions you might have to improve species-level resolution

Thank you again for your clear, practical guidance — it truly made a difference!

2 Likes

Hello @Salma_Sarker

You may find this helpful:

The Qiime2 devs have done a bunch of work on taxonomy classification! Here is one of their big papers:

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.