I switched from blast to vsearch for the classifier and thought it would be much quicker thanks to the multithreading but was very disappointed to see that its is in fact much slower (from the ongoing run which has processed only 4k reads in >12h on 84 cpu). The cpus are 100% loaded but little output comes out.
I suspect that the database not being indexed, a lag time is lost for each new search over the quite large rrnDB database of 108986206 nt in 22351 seqs (rrnDB)
I found a thread discussing the recent addition of indexing for vsearch here.
Could it be that this is not yet implemented in the qiime2 wrapper (maybe because a number of bugs were reported in the thread) and could it please be considered if it answers my needs and is now stable and working well.
this does not look like the thread example commands below where the database has a .udb suffix instead of my dna-sequences.fasta (real fasta, I checked)
## example command with index from the thread linked above:
# make index
vsearch --makeudb_usearch CGDBv2.0.fa --id 0.83 --minwordmatches 1 --wordlength 8 --output CGDB2.0.udb -dbmask none -qmask none --strand both --maxaccepts 10000000 --query_cov 0.95 --maxgaps 1 -maxrejects 4096 --maxseqlength 1000000
#runsearch using index
vsearch --usearch_global test_primers.fa -db CGDBv2.0.udb --id 0.83 --minwordmatches 1 --wordlength 8 --dbmask none --qmask none --strand both --userout against_udb.vsearch --userfields 'query+qstrand+qlo+qhi+qrow+target+tstrand+tilo+tihi+trow+mism+opens' --maxaccepts 10000000 --query_cov 0.95 --maxgaps 1 --maxrejects 4096 --threads 15 --maxseqlength 1000000
While using udb indexes can be faster, I think the major slowdown is another part of this command. --maxrejects 0
This sets how many target reads get aligned before vsearch gives up and reports no hits. Setting this to 0 means that the full database will be searched if a hit is not found.
This can slow things down a lot. The default is 32, but raising it to 100 or so is really reasonable, but also much faster. Searching the full database is just silly.
Maybe this should be exposed as a new parameter in the plugin!
Hacktoberfest is here, and there is an open issue!
This would be a good first issue, if you are interested!
Hi Colin,
It seems I keep stepping into shortcomings here
Your comments make perfect sense and I would love to contribute but my python is really too bad to risk this.
I hope someone else will be skilled enough and could also add the indexing as option for even more speed!
I was working with only 10% of my data and it was taking >30h with blast already and freezing miserably with vsearch for the reason you pointed to.
To be fair, you are using an extraordinarily large reference database, and also presumably long sequences:
Have you considered dereplicating or clustering your reference sequences into OTUs to reduce redundancy?
My previous recommendation that vsearch is faster than blast (due to multithreading) may be wrong in this case, since vsearch may not scale as well to longer sequences (since it is performing global rather than local alignment), so the advantage of parallelization is lost.
We will look into this, thanks for notifying us of this new feature in vsearch. Adding a new vsearch index type to QIIME 2 would not be a trivial task, however, so something we probably cannot tackle in time for this month's release. However, the other features that @colinbrislawn linked to might be achievable, especially if someone wants to grab that issue in the spirit of Hacktoberfest. I promise it would be an easy one, even for someone without much python experience!
By contrast when I tried adding code to specify maxrejects with default to 100 in the different parts of the _vsearch code with inspiration from the maxaccept pieces of code, running led to error messages. Not so easy after all
Just FYI, I made these changes to the development source code yesterday, so with luck these new features will be added in the next release of QIIME 2, coming out at the end of this month.
maxrejects and maxhits and a couple other features are now exposed… but not the indexed database, which will have to wait for another day. That will take a bit more work and I’d like to see some benchmarks before taking the time to implement in q2-feature-classifier.
Thanks a lot Nicholas,
I will benchmark the sole vsearch command tonight with and without index (I will fix the maxreject to 4096 as in the indexing example I found) and will share here the two times. I will derive the indexed-db command from the one run by qiime, only changing the input database. this should hopefully run and give us a comparison on my big data
Here are the results of a veeeeery long job repeated both on fasta and udb and showing a zero improvement after indexing. This does not exclude some benefit in other setups of course.
So I will next time run against fasta but lower the --maxrejects to 100 or so to run this is hours rather than days.
Yeah, I have found that too. Given that vsearch indexing is often IO bound, the speedup from using an index is only really worth it if you have crazy IO and ram, and very few CPU threads. In my own testing, the index could be made right from the fna file faster than the UBD file could be loaded.
Have you dereplicated your dna-sequences.fasta file?