Using the silva pre-trained classifier consumes more than 256 GB of memory

BenKaehler · September 15, 2017, 4:37am

Hi Everyone, sorry for the slow response.

Others have had issues with the silva classifier. My last response on the issue is here.

In summary, when I run either of the pre-trained silva classifiers on my laptop they use up a maximum of around 11GB of memory with a minimal number of rep seqs. One way to bring the memory usage down is to reduce the chunk size using the --p-chunk-size parameter. Starting with a chunk size of 1,000 is one way to start debugging this problem. (And yes, changing --p-n-jobs from its default of 1 will cause multiple copies of the classifier artifact to be loaded into memory.)

@ebolyen, is it possible that cluster users are running into problems if large artifacts are getting unzipped into temp directories that are mounted over a network? Setting TMPDIR, or whatever the variable is on the user's cluster environment, to scratch that is local to the worker nodes may help.

Just by way of explanation, the silva database has always been troublesome, partly because of the large number of unique taxonomies (82,325 in the taxonomy file I'm looking at).

@yoshiki, are you still waiting for your 20,000 sequences to finish? If you are able to share them with me I'm happy to debug.