I’m currently working on the analysis of soil bacterial communities using the 16S rRNA gene region, amplified with primers 341F and 806R. I have completed the DADA2 denoising step and subsequently clustered the ASVs into OTUs at 97% similarity using VSEARCH. My current OTU representative sequence file contains 72,077 OTUs.
I am now attempting to assign taxonomy using the SILVA 138.2 NR99 database, which I have processed through RESCRIPt using the following dereplication command:
To assign taxonomy, I am using qiime feature-classifier classify-consensus-vsearch, but the job has failed multiple times on our HPC system after long run times (attempted 3, 5, and 6 days). The job consistently fails due to time-out errors. Below is my SLURM job script:
Both. These are both very large. You probably do not want to reduce the size of your reference database, but you could attempt to filter out OTUs to reduce complexity. 72k is a lot! Many are probably noisy and/or very low abundance so could be worth dropping, depending on your experimental goals.
Lowering maxaccepts probably would reduce runtime by a little bit, but it looks like you already set this to 1, so you cannot go lower than 1. perc-identity would probably not impact runtime.
Yes increasing threads would reduce runtime, if you have the resources for this. The pre-trained classifier would probably be faster as well, but you have a very large query so that is the main reason for the long runtime.
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
This setting works just like maxaccepts, but for reads that don't match the database.
Because this uses all as the default, if there is nothing in the database >90% similar, then the entire database will be searched! This is extremely slow!
It looks like this data may have triggered an edge case, which will make the uncommon features take a very long time to search.
I recommend using medium or large value for maxrejects, like 100 or 1000.
Vsearch uses a value of 32 by default, so this is a few orders of magnitude larger!
This will illuminate this edge case but there are still many features to search! Let me know what kind of speedup you get from adding --p-maxrejects 100
Thank you so much for your kind support and helpful suggestions regarding my taxonomy assignment issue using classify-consensus-vsearch.
Following your advice, I performed the following steps:
Filtered the OTU table to remove low-abundance features (less than 10 total reads) and those found in fewer than 2 samples, reducing the number of OTUs from 72,077 to 8,685.
Included --p-maxrejects 100 in the VSEARCH classification step to improve performance for poor or no matches.
Increased the number of threads to 6, matching SLURM CPU allocation.
After implementing these optimizations, I re-ran the job on our HPC system (150GB memory), and I’m happy to report that the taxonomy assignment was successfully completed within a few minutes — a dramatic improvement over the previous timeouts after several days.
Upon reviewing the results:
6,944 OTUs were assigned to Kingdom Bacteria
1,650 OTUs remained unassigned
All assignments were at consensus level 1
However, none of the OTUs were resolved down to species level
I would really appreciate any suggestions you might have to improve species-level resolution
Thank you again for your clear, practical guidance — it truly made a difference!