Vsearch clustering memory estimating

I’m working with 625 MiSeq samples & trying to use vsearch to cluster against a Silva 132 16S database. There is ~30 GB RAM + 55 GB VM, & the process is running on a 2nd mounted 4 TB drive (instead of on the main SSD which is largely handed over to VM). The process chugged along for 9 days (initially using all 8 cores), then abruptly stopped. The kernel log shows:
[755528.607932] Out of memory: Kill process 10717 (qiime) score 929 or sacrifice child
[755528.607934] Killed process 10717 (qiime) total-vm:82150232kB, anon-rss:28305832kB, file-rss:16kB, shmem-rss:0kB

The input file has already been dereplicated. I could cluster against a lesser database (Greengenes), or I could go back & split the sequences into 2 smaller sets (they represent water samples & biofilm samples). Obviously, not my 1st choices but maybe necessary. Any suggestions are welcome, & thanks for the great help you have already provided!

Linda

Hi @Ldrhodes,

Are you doing closed or open reference clustering? Splitting will be fine with closed-ref, but can’t be done with open-reference.

Given the fact that you ran out of memory with 30gb, I assume you are doing open-reference? There may not be an alternative other than using a machine with more memory. :frowning:

If you processed your data with DADA2 or Deblur prior to this, I would argue that closed-reference is sufficient as you can treat your ASVs as “de-novo” clusters all by themselves, but that obviously only works if you have ASVs and not just raw reads.

Hi Evan,
Yes, it was open reference clustering, as we used other methods for the earlier processing. There was actually ~ 80 GB available (30 in RAM & > 50 in VM), but maybe only RAM is relevant for this process. I’m only now figuring out where to look for intermediate files to figure out how far along the process made it.
I (maybe naively) started re-running against the Greengenes 13_8 db, just to see what happens - maybe the memory demand is linked to # of sequences, rather than db size. And will go back & split the sequences into water vs biofilms. Thank goodness the other dataset are not as big as this one!

Hey @Ldrhodes,

What kind of questions are you trying to answer with this data? And what kind of earlier processing did you perform? It may be the case that closed-reference would be sufficient, and that shouldn’t have the same memory issues.

Hi Evan,
We have sequences from the hyporheic zone of urban streams with restored vs unrestored reaches, plus reaches in a more “pristine” watershed (not messed with for > 100 years). Nested within multiple years of monitoring the hyporheic water in the restored reaches is an inoculation experiment in the 1st year post-restoration. That experiment involved transplanting hyporheic substrate from the “pristine” watershed to one of the restored reaches - these are the biofilm samples. (This is where we could split the water from biofilm samples, if reducing sample # would help with memory issues.) We anticipate that there may be many novel or previously undetected taxa (plus the known nasties of an urban environment), which is why I chose to go the open reference route.

That sounds really cool!

Did you use a denoising approach for your initial read processing? Either DADA2 or Deblur?

If so, I would still use closed-reference, because the q2-vsearch plugin will report which ASVs were not mapped. You can then still do normal analysis with that data (or even merge it with your closed-reference OTUs), those ASVs just won’t hit a specific database (but that would have been the case with open-reference anyhow).

So unless you really need to perform de-novo OTU clustering, you could save a lot of computation time by skipping that step of the open-reference pipeline (leaving you with just closed-reference).

And unfortunately, splitting your data between water and biofilm will mean that any reads which might have been the same/comparable will become different unless you get lucky with the cluster centroids during the de-novo OTU clustering step. This would not be the case with closed-reference + using leftover ASVs directly.

Yes, we used Trimmomatic for adapter removal, quality score filtering, short read removal, & trimming of single sequence reads, then joined with Pandaseq. An additional filtering step to remove pairs with runs of N or homopolymeric run, or pairs < 400 bp, then dereplicated. In an earlier attempt, we found that chimera removal before clustering took much longer than after clustering, so had re-arranged the order. Perhaps removing chimeras would help.

I went ahead & tried running the closed reference clustering option. The good news is that it got to the kill step much more quickly (2 days instead of 9 days), but it still hit a memory problem.

[2027665.547998] Out of memory: Kill process 14386 (qiime) score 880 or sacrifice child
[2027665.548001] Killed process 14386 (qiime) total-vm:77769064kB, anon-rss:30164192kB, file-rss:56kB, shmem-rss:0kB

In all attempts (open vs closed; Silva 132 vs Greengenes 13_8), the final step at the kill seems to be writing Fasta format files. Looking at the QIIME2 log file, it seems the last temporary file written before the Fasta file writing contained 32,540,223 sequences. One Fasta file (16GB) was successfully written, but the kill happened before the 2nd Fasta file could be written. Maybe I can ask for more VM & give it another try…

Thanks for your encouragement & advice about not splitting the water & biofilm samples - I’ve really resisted doing it so far.

I think anything that decreases the size of the input would help. Chimera removal comes after OTU clustering in QIIME 2, but it you could do this externally (e.g., remove chimera using VSEARCH stand-alone).

Could you please post the exact command that you are using?

I would expect sizable differences for memory demands between these databases, so it is a little surprising that they both kill abruptly at the same point. It may be worth testing on a small test dataset just as a sanity check. It sounds like this is just a memory issue and you have large inputs/outputs, not a system problem, but doesn’t hurt to check.

Sorry this is dragging along!

Many thanks for your suggestions. It finally got through the process with a total of ~ 90 GB of memory. This is the largest dataset I’ll put through, so hopefully over the hill on onto fun stuff. You all are invaluable!

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.