using database indexing for vsearch

splaisan · October 21, 2019, 7:39am

I switched from blast to vsearch for the classifier and thought it would be much quicker thanks to the multithreading but was very disappointed to see that its is in fact much slower (from the ongoing run which has processed only 4k reads in >12h on 84 cpu). The cpus are 100% loaded but little output comes out.

I suspect that the database not being indexed, a lag time is lost for each new search over the quite large rrnDB database of 108986206 nt in 22351 seqs (rrnDB)

I found a thread discussing the recent addition of indexing for vsearch here.

Could it be that this is not yet implemented in the qiime2 wrapper (maybe because a number of bugs were reported in the thread) and could it please be considered if it answers my needs and is now stable and working well.

the command I run in qiime2 is:

qiime feature-classifier classify-consensus-vsearch --i-query rep-seqs.qza --i-reference-reads /data/biodata/MetONTIIME_DB/rrnDB_operons_sequence.qza --i-reference-taxonomy /data/biodata/MetONTIIME_DB/rrnDB_operons_taxonomy.qza --p-perc-identity 0.77 --p-query-cov 0.8 --p-maxaccepts 1 --p-strand both --p-min-consensus 0.51 --p-unassignable-label Unassigned --p-threads 84 --o-classification taxonomy.qza

which wraps vsearch as:

vsearch --usearch_global /tmp/qiime2-archive-wn0t7u2n/2f8ef59b-81ea-4a59-badf-4680f5df5fa9/data/dna-sequences.fasta --id 0.77 --query_cov 0.8 --strand both --maxaccepts 1 --maxrejects 0 --output_no_hits --db /tmp/qiime2-archive-e9amzhku/c9e2d584-3adb-4f69-b887-30d7d888a9f8/data/dna-sequences.fasta --threads 84 --blast6out /tmp/tmpvg9lfrb0

this does not look like the thread example commands below where the database has a .udb suffix instead of my dna-sequences.fasta (real fasta, I checked)

## example command with index from the thread linked above:

# make index
vsearch --makeudb_usearch CGDBv2.0.fa --id 0.83 --minwordmatches 1  --wordlength 8 --output CGDB2.0.udb -dbmask none -qmask none --strand both --maxaccepts 10000000 --query_cov 0.95 --maxgaps 1 -maxrejects 4096 --maxseqlength 1000000

#runsearch using index
vsearch --usearch_global test_primers.fa -db CGDBv2.0.udb --id 0.83 --minwordmatches 1 --wordlength 8 --dbmask none --qmask none --strand both --userout against_udb.vsearch --userfields 'query+qstrand+qlo+qhi+qrow+target+tstrand+tilo+tihi+trow+mism+opens' --maxaccepts 10000000 --query_cov 0.95 --maxgaps 1 --maxrejects 4096 --threads 15 --maxseqlength 1000000

Thanks in advance for your remarks and/or help

colinbrislawn · October 21, 2019, 1:48pm

Hello again Stephane,

While using udb indexes can be faster, I think the major slowdown is another part of this command.
--maxrejects 0
This sets how many target reads get aligned before vsearch gives up and reports no hits. Setting this to 0 means that the full database will be searched if a hit is not found.

This can slow things down a lot. The default is 32, but raising it to 100 or so is really reasonable, but also much faster. Searching the full database is just silly.

Maybe this should be exposed as a new parameter in the plugin!

Hacktoberfest is here, and there is an open issue!
https://github.com/qiime2/q2-feature-classifier/issues/151

This would be a good first issue, if you are interested!

Colin

splaisan · October 21, 2019, 3:04pm

Hi Colin,
It seems I keep stepping into shortcomings here
Your comments make perfect sense and I would love to contribute but my python is really too bad to risk this.

I hope someone else will be skilled enough and could also add the indexing as option for even more speed!
I was working with only 10% of my data and it was taking >30h with blast already and freezing miserably with vsearch for the reason you pointed to.

We'll see...

Best

Nicholas_Bokulich · October 21, 2019, 4:23pm

To be fair, you are using an extraordinarily large reference database, and also presumably long sequences:

Have you considered dereplicating or clustering your reference sequences into OTUs to reduce redundancy?

My previous recommendation that vsearch is faster than blast (due to multithreading) may be wrong in this case, since vsearch may not scale as well to longer sequences (since it is performing global rather than local alignment), so the advantage of parallelization is lost.

We will look into this, thanks for notifying us of this new feature in vsearch. Adding a new vsearch index type to QIIME 2 would not be a trivial task, however, so something we probably cannot tackle in time for this month's release. However, the other features that @colinbrislawn linked to might be achievable, especially if someone wants to grab that issue in the spirit of Hacktoberfest. I promise it would be an easy one, even for someone without much python experience!

splaisan · October 22, 2019, 1:08pm

I changed

cmd = ['vsearch', '--usearch_global', seqs_fp, '--id', str(perc_identity),
      '--query_cov', str(query_cov), '--strand', strand, '--maxaccepts',
      str(maxaccepts), '--maxrejects', '0', '--output_no_hits', '--db',
      ref_fp, '--threads', str(threads)]

'0' => '100' and it now runs nicely.

By contrast when I tried adding code to specify maxrejects with default to 100 in the different parts of the _vsearch code with inspiration from the maxaccept pieces of code, running led to error messages. Not so easy after all

Nicholas_Bokulich · October 22, 2019, 2:15pm

Just FYI, I made these changes to the development source code yesterday, so with luck these new features will be added in the next release of QIIME 2, coming out at the end of this month.

maxrejects and maxhits and a couple other features are now exposed... but not the indexed database, which will have to wait for another day. That will take a bit more work and I'd like to see some benchmarks before taking the time to implement in q2-feature-classifier.

splaisan · October 22, 2019, 2:37pm

Thanks a lot Nicholas,
I will benchmark the sole vsearch command tonight with and without index (I will fix the maxreject to 4096 as in the indexing example I found) and will share here the two times. I will derive the indexed-db command from the one run by qiime, only changing the input database. this should hopefully run and give us a comparison on my big data

splaisan · October 31, 2019, 9:13am

Here are the results of a veeeeery long job repeated both on fasta and udb and showing a zero improvement after indexing. This does not exclude some benefit in other setups of course.
So I will next time run against fasta but lower the --maxrejects to 100 or so to run this is hours rather than days.

Please read Redirecting to Google Groups for more discussion about limits

# create index
+ vsearch --makeudb_usearch /data2/analyses/MetONTIIME_4smpl_rrnDB/indexing_test/reference-sequences.fasta --id 0.77 --minwordmatches 1 --wordlength 8 --output /data2/analyses/MetONTIIME_4smpl_rrnDB/indexing_test/reference-sequences.udb --dbmask none --qmask none --strand both --maxaccepts 10000000 --query_cov 0.8 --maxgaps 1 --maxrejects 4096 --maxseqlength 1000000

vsearch v2.8.1_linux_x86_64, 503.8GB RAM, 88 cores
https://github.com/torognes/vsearch
Reading file /data2/analyses/MetONTIIME_4smpl_rrnDB/indexing_test/reference-sequences.fasta 100%
108986206 nt in 22351 seqs, min 3831, max 5664, avg 4876
Counting k-mers 100%
Creating k-mer index 100%
Writing UDB file 100%

real 0m6.721s
user 0m5.992s
sys 0m0.684s

# search against fasta
+ vsearch --usearch_global /data2/analyses/MetONTIIME_4smpl_rrnDB/indexing_test/dna-sequences.fasta --id 0.77 --query_cov 0.8 --strand both --maxaccepts 1 --maxrejects 4096 --output_no_hits --db /data2/analyses/MetONTIIME_4smpl_rrnDB/indexing_test/reference-sequences.fasta --threads 84 --blast6out /data2/analyses/MetONTIIME_4smpl_rrnDB/indexing_test/fasta_out

vsearch v2.8.1_linux_x86_64, 503.8GB RAM, 88 cores
https://github.com/torognes/vsearch
Reading file /data2/analyses/MetONTIIME_4smpl_rrnDB/indexing_test/reference-sequences.fasta 100%
108986206 nt in 22351 seqs, min 3831, max 5664, avg 4876
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching query sequences: 27174 of 57801 (47.01%)

real 6200m36.366s
user 464528m4.904s
sys 855m14.928s

# search against udb
+ vsearch --usearch_global /data2/analyses/MetONTIIME_4smpl_rrnDB/indexing_test/dna-sequences.fasta --id 0.77 --query_cov 0.8 --strand both --maxaccepts 1 --maxrejects 4096 --output_no_hits --db /data2/analyses/MetONTIIME_4smpl_rrnDB/indexing_test/reference-sequences.udb --threads 84 --blast6out /data2/analyses/MetONTIIME_4smpl_rrnDB/indexing_test/udb_out

vsearch v2.8.1_linux_x86_64, 503.8GB RAM, 88 cores
https://github.com/torognes/vsearch
Reading UDB file /data2/analyses/MetONTIIME_4smpl_rrnDB/indexing_test/reference-sequences.udb 100%
Reorganizing data in memory 100%
Creating bitmaps 100%
Parsing abundances 100%
108986206 nt in 22351 seqs, min 3831, max 5664, avg 4876
Searching 100%
Matching query sequences: 27174 of 57801 (47.01%)

real 6128m59.346s
user 466170m36.924s
sys 978m51.560s

colinbrislawn · October 31, 2019, 3:58pm

Yeah, I have found that too. Given that vsearch indexing is often IO bound, the speedup from using an index is only really worth it if you have crazy IO and ram, and very few CPU threads. In my own testing, the index could be made right from the fna file faster than the UBD file could be loaded.

Have you dereplicated your dna-sequences.fasta file?

Colin

splaisan · October 31, 2019, 4:16pm

Yes, I ran MetONTIIME which includes vsearch dereplication before classification

echo -e "\nqiime vsearch dereplicate-sequences"
time qiime vsearch dereplicate-sequences \
  --i-sequences sequences.qza \
  --o-dereplicated-table table_tmp.qza \
  --o-dereplicated-sequences rep-seqs_tmp.qza

echo -e "\nqiime vsearch cluster-features-de-novo"
time qiime vsearch cluster-features-de-novo \
  --i-sequences rep-seqs_tmp.qza \
  --i-table table_tmp.qza \
  --p-perc-identity 1 \
  --o-clustered-table table.qza \
  --o-clustered-sequences rep-seqs.qza \
  --p-threads $THREADS

splaisan · November 4, 2019, 10:03am

Dear All,

(Torbjørn) The author of vsearch answered about the optimal settings for my application.

When someone edits the vsearch classifier module, this info might be a good guidance (I let you decide about this obviously)

--maxaccepts 100 --maxrejects 100 --maxhits 1

Please read details here :
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/vsearch-forum/xwlzDbL91Cc/q3I01DTAAAAJ