How much RAM is needed for feature classification?

ctekellogg · December 13, 2017, 8:13pm

Is there a way to figure out how much RAM the Naive Bayes classifier would need to classify a 20 MB OTU/rep set fasta file? Similarly, how much RAM would be necessary for deblur or DADA2 to process a 25-30 GB fastq file? Just wondering if I can make this work on my laptop while trying to get my lab to emerge from the MacOS dark ages.

Nicholas_Bokulich · December 14, 2017, 2:32pm

Hi @ctekellogg,

No, there is not really a straightforward way to estimate, but I can offer a few tips to lower memory consumption.

The main factor driving memory use is the size of the reference sequence database. So using smaller reference sequence databases (e.g., greengenes rather than SILVA) and shorter sequences will reduce memory load.
See this post for some other tips (chunk-size is now called reads-per-batch)

@benjjneb and @wasade may be able to offer some advice on memory consumption with these methods.

I hope that helps!

BenKaehler · December 14, 2017, 7:24pm

Just in addition to @Nicholas_Bokulich's comments, if you're worried about memory, don't set n-jobs to anything other than one. Most of the memory usage is in loading the classifier object, and the number of times you do that is n-jobs.

For what it's worth, it runs just fine using greengenes on my MacOS laptop

benjjneb · December 14, 2017, 7:48pm

The DADA2 plugin is processing samples individually, so memory requirements should be nearly flat with increasing sample number, and will be driven instead by the "largest" sample you have (in terms of unique sequences, not raw reads). That is pretty dataset specific, so its hard to give you a one-size fits all answer, but I've generally found 16GB sufficient for just about anything. It never hurts to have more than enough memory, though!

wasade · December 20, 2017, 5:28pm

Similar to @benjjneb's comment on DADA2, Deblur's profile will remain pretty flat over sample count and peak with the largest sample. I normally allocate 8GB per thread when executing (as that's effectively the default on our compute resource), and that was sufficient deep studies such as Yatsunenko et al 2012 which averaged over 1M reads per sample.

Best,
Daniel

system · January 20, 2018, 11:28pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.