Vsearch classifier memory

Hi all,
I was delving into understanding how changing two parameters in a vsearch-classified method would affect the resulting taxonomic information. Here was the argument:

qiime feature-classifier classify-consensus-vsearch \
--i-query "$FASTAPATH"/dada2.arthseqs.qza \
--i-reference-reads "$REFPATH"/boldCOI.derep.seqs.qza \
--i-reference-taxonomy "$REFPATH"/boldCOI.derep.tax.qza \
--p-perc-identity 0.90 --p-query-cov 0.94 --p-strand both \
--p-threads 24 --p-maxaccepts 0 \
--o-classification derep.vsearch_p90search.qza

This led to an error which suggests I ran out of memory (I had ~ 128Gb or RAM dedicated to the above argument):

slurmstepd: error: Detected 4 oom-kill event(s) in step 66653.batch cgroup

One of the things I was thinking would resolve this was to split the --i-query into a bunch of chunks. I wanted to double check with the community (@colinbrislawn - thoughts?) that each representative sequence is independently assessed when being classified, so splitting the input fasta wouldn’t affect the result, right?

Thanks for the consideration!

1 Like

Correct. Splitting the query seqs into batches is most likely what vsearch is doing under the hood anyway for parallel processing.

However, it may or may not help your memory issue. The problem may be the size of the reference dataset and/or if multiple copies are stored in memory during parallel processing. Maybe try using fewer threads.

Good luck!

Perfect - thanks @Nicholas_Bokulich,
Follow up question - there are plenty of merging options, including the ability to combine sequence artifact files. I wonder if it trivial or complicated to request some functionality that does the opposite.

The seqkit program, for example, lets you split a fasta in a number of ways including what I think are the most intuitive: splitting a fasta into either N chunks, or M sequence records. Having a splitting function in QIIME2 might be handy for instances like this one, where you want to take your sequence artifact file and break it up?

Thanks!

Thanks for the suggestion @devonorourke! I have raised a feature request for this.

In truth I am not sure how much enthusiasm there would be for this, since I am not sure that it would be very widely used — but I am all for adding a basic utility function like this, especially if someone out there in the world is interested in contributing a pull request to add it to QIIME 2 :wink: .

Gentlemen!

Just wanted to comment to mention that vsearch does use more memory per thread… but not by much. Using 10x more threads only increases memory usage by about 20%.

At the risk of stating the obvious, the main usage of memory is the size of the database and the length of k-mers you use to index it. ~128 GB of ram is huge so… how big is your database? Have you considering pre-clustering it at 99% of something in order to reduce its size?

Colin

2 Likes

Forgot to mention:

Yep, you could do that, but it won’t solve your memory woes. Memory is all about the database, not the query and you can’t split up the database without changeing the result.

Pre-clusting the database to reduce its size is still going to be my recommendation. Or rent a computer with 384 GB of ram for literally $2 / hour. :dollar: :dollar:

Colin

Thanks @colinbrislawn,
Very helpful to understand where the memory issues might be arising. Also very disappointing on my end because I thought I had a good solution... and apparently I don't :persevere: !!

Question for you: is there any back of the envelope calculation that would help me understand the amount of memory needed given the size of the database's fasta file? Understandably this probably depends on the kmer uniqueness of the database, but even a ballpark would be helpful. I have access to servers with up to 1Tb of RAM, so unless it looks like I'd exceed that capacity, I'll avoid paying Amazon their $2/hour in this instance!

(pun intended).

The COI arthropod db I set up is dereplicated down to about 1.8 million records. Further clustering at 99% would reduce this dramatically, down to about 400,000 records. My concern/hesitation in doing so was losing taxonomic information in the process; as you might expect, the number of distinct Genus and Species records is reduced with even a minor amount of clustering:

1 Like

Ah! I can see your dilemma. Yeah, that makes sense to me too.

If you can do a trial run on the node with 1T of memory, you could run vsearch directly and use the --log flag to figure out the max memory used during the process. That would at least give you a target for what you need.

This is a longshot, but if your reads are an exact match to the database (100% and same length), you could use vsearch --search_exact. This uses much less ram and is crazy fast, but only works for literally identical reads. If you have 100% matches at the species level, this option may be worth trying.

Keep up the good work!

Colin

1 Like

Wait a sec… I think I’m overlooking something.

@colinbrislawn - I’ve already loaded this database in this same compute cluster (many times) previously without any issue. The thing I changed from those prior instances to the time it failed was switching up two parameters:

  1. the --p-maxaccepts had been left in prior runs to the default (10, I think). What I switched up when the memory issue creeped up was I forced the program to search the entire database by setting --p-maxaccepts 0.

  2. The --p-perc-identity was switched from 97% to 90% to try to recover more distant hits.

I know this will absolutely increase the number of hits I record, but I’m not connecting how increasing the number of hits relates to the memory crunch (unless my list of records is growing so much more that the combination of database and list is exceeding 128 Gb of RAM?)

What I wanted to do was compare how many more ambiguous taxa would occur when I made my search parameters more liberal.

There are probably hundreds if not thousands of hits for most records once I set that --p-maxaccepts to 0. Does anyone know what happens internally in vsearch when the user specifies the --maxhits parameter to 1?
To follow a (admittedly weak) wrestling analogy:

  • Option1: does it go Royal Rumble style, where all possible sequences battle each other for the best rank… in this instance, the program would collect all possible hits, then compare them all to choose a winner.
  • Option2: is there a champion belt, where there are a series of paired fights? the first “hit” is declared the de facto champ, and each successive “hit” is compared to the champ for acting as title holder as best sequence match?

I don’t see how you’d be able to invoke the LCA algorithm with Option2, so perhaps it’s a moot point. I am curious what vsearch does with that --maxhits 1 option though.

Maybe what I need to do is just avoid using the exhaustive search, and instead cap the number of hits at something like 100 or 500.

Thanks!

Thanks @colinbrislawn! This can be an issue with parallelization of other methods but I have not benchmarked vsearch runtimes so wanted to be sure.

Yep, sounds like you are retrieving an extremely large number of hits, particularly by using maxaccepts=0 to retrieve all hits. This hits file is later being loaded into memory (not read line by line) to perform consensus taxonomy assignment. It must be a humongous file!

maxaccepts=100 would be liberal. Taking all hits that have ≥ 90% similarity is extreme. Go with maxaccepts=100 and I think you will accomplish what you are after, without hitting memory constraints.

Maybe @colinbrislawn has used this parameter, I have not. It sounds like this just impacts how many results are displayed, whereas maxaccepts impacts how many hits are actually considered. So more similar to your “option 1” — the hits are ranked by similarity to the query and the top hit would be taken. We could consider exposing this parameter in classify-consensus-vsearch if it is useful.

Update for @colinbrislawn and @Nicholas_Bokulich,
I reran with --maxaccepts 200 and the job finished without any issue. The memory problem was not to do with the database size, apparently, but the number of hits.

This could be for any number or reasons:

  1. Maybe it’s because my arthropod COI query sequences are shorter (181 bp) than a lot of 16S amplicons? Or perhaps the region is generally more conserved than other marker genes of comparable length.
  2. Perhaps the COI gene writ large is more conserved than 16S or 18S or ITS? No idea.

But I think the most likely candidate is number 3.
3. Most of the COI queries come from a handful of arthropod Orders: things that fly with two wings (dipterans), beetles (coleopterans), and moths (lepidopterans). That is, my bats eat three kinds of things…
Turns out the 1.8 million arthropod records in my reference database also are dominated by those same kinds of things.
By setting a % identity at around 90%, there are a vast number of matches for every one of my 13,400 query sequences I’m searching the database against, making that list incredibly long.

One question about programming this: wouldn’t it be possible to perform the exhaustive search per-query, then apply the LCA process to declare a winner, then iterate to the next query? Then your list of candidates would only ever grow as large as the database itself (presumably each query couldn’t be a hit to the entire database, but that would be the theoretical upper limit, right?).

Is there some advantage of processing all queries first into a single lest, then applying LCA?

Anyway, just wanted to summarize:
The memory issue raised here was about the number of “hits” being recorded - it was inflated because the percent identity was too liberal when coupled with an exhaustive --maxaccept 0 value.

Oh, and @Nicholas_Bokulich - the idea behind a --maxhits parameter is probably not useful when you want to apply LCA, because vsearch will pick just one winner (so you’d never be able to find instances where multiple sequences above your threshold warrant LCA). Feel free to correct me @colinbrislawn!

Cheers

Sounds about right. Honestly, this is not likely to be an issue unique to COI… if we did the same thing with a 16S database we may well have the same memory issue.

That would be ideal, but the issue is that this process occurs in two distinct steps. Vsearch does not perform the LCA, q2-feature-classifier does. So vsearch does its thing, produces a list of hits for each query, then q2-feature-classifier looks at that list to perform LCA. That could happen more efficiently, e.g., instead of reading the whole list into memory it could just grab N lines at a time where N == maxaccepts. The only issue with that is that there could be fewer than N records for each query (e.g., if no hits were found). So this is a more challenging problem than it seems at face value.

I whole heartedly endorse that these things are over my head, but I would offer the insight that when I run vsearch with this command:

vsearch --usearch_global $FASTA --db $REF --threads 24 \
--id 0.90 --query_cov 0.1 --strand both --maxaccepts 0 --maxrejects 0 --maxhits 1 \
--output_no_hits --blast6out vsearch.p90c10out.txt

… I can see that the records are being added iteratively to the vsearch.p90c10out.txt file. After one minute there might be five records, and a few minutes later there is the 6th, 7th, etc… So something about that function allows for records to be produced iteratively.

Maybe there would be a way to submit one query at a time from the seqs.qza file in the QIIME version, pass that along to vsearch running a similar command as above - except no --maxhits flag! - and then use that entire list of matches to the one query sequence to run LCA.

If you’re loading the first batch of matches from the vsearch.p90c10out.txt file, you can flag for instances in which there are no matches because there is a specific symbol that vsearch uses to indicate this in their output - it’s a * in the second column:

002893244547630fd796300fe89df2a7        3945214 100.0   20      0       0       1       181     1       146     -1      0
0016e4bbcf484b6b1cd8c75ce052cf3f        *       0.0     0       0       0       0       0       0       0       -1      0

I think because there is single character indicating what to do with NULL results, you should be able to avoid that problem of no hits happening?

Anyhow, other fish to fry today, and the problem of this thread was resolved.
Thanks for the input!

1 Like

Good morning,

Lot’s of great progress being made in this thread! I agree with the conclusion that the excessive number of hits in the vsearch results is causing the downstream LCA to crash. So the vsearch memory is not the problem, but it might include the solution.


Now let’s talk about vsearch.

Remember that before performing optional alignments, vsearch first sorts possible hits by shared kmers. --maxaccepts 0 will perform optional alignments on ALL reads Royal Rumble style, totally destroying the advantage of the kmer prefilter. Passing --maxaccepts 0 --maxhits 1 will still exhaustively search the full database, but then it would report only the top hit. :1st_place_medal:

Yes, that’s what vsearch does by default! So once you have found a winner over your threshold, you can use --maxaccepts to control how many challengers you allow to fight the champ, and --maxhits to control the number of winning trophies you give out in the end. :1st_place_medal: :2nd_place_medal: :3rd_place_medal:

I absolutely think we should expose --maxhits. Having both --maxhits and --maxaccepts would let us increase the true positive rate without overwhelming the LCA with too many hits.


And now we have another solution for this problem!

And the current memory bottleneck is the LCA step, so in this case dividing the reads into smaller batches would probably solve this problem!

Great work guys. I’m glad we are finding success with vsearch.
Colin

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.