You could check out the reads-per-batch parameter to reduce memory use. Multithreading will increase the memory demand so do not use the threads parameter to limit this.
For SILVA we’ve heard a bunch of different reports… e.g., some have reported needing 32GB, others (with the advice I’ve given above) can squeeze by with less!
One thing is for sure — if you use RESCRIPt to grab the SILVA data and use the --p-no-include-species-labels option, it greatly reduces runtime and memory use, since the number of classes is greatly reduced.
Since you mentioned RESCRIPt, I assumed you were using evaluate-fit-classifier to fit and test.
None of the other parameters will impact memory usage.
SILVA tends to be a memory hog… reducing database size (e.g., by filtering out low-quality sequences as described in the RESCRIPt tutorial) and reducing label complexity (e.g., dropping species labels as I described above) are the only real ways to reduce memory demand during training. Chunk size might help, but in my experience it does not.
YES! It really differs to drop species label. After the steps in RESCRIPt tutorial and reducing it to V3-V4 specific region, I successfully trained my classifiers very very quickly. But still, I continued my experiments with species labels included database.
Numbers below belong to species labels included database.
evaluate-fit-classifier with --p-reads-per-batch 10000 and --p-n-jobs 1 parameters worked for 32 hours. Memory usage almost never went over 16 GB, usually at 8 GB.
Whereas, fit-classifier-naive-bayes with --p-classify--chunk-size 10000 parameter took 6 hours. Maximum memory usage was 42 GB and it usually used 30 GB.
In RESCRIPt tutorial, it is mentioned that evaluate-fit-classifier and fit-classifier-naive-bayes are almost the same, but seems like they require different amounts of memory and time. Why does this happen? Could you explain it in simple terms, please?
Thank you very much for sparing time to answer such simple questions in this topic.
The trained classifier output by them should be identical if the inputs are identical.
but that does not mean that they are the same or take the same amount of time to run… evaluate-fit-classifier fits the classifier, tests it on the same set of sequences, and then evaluates the accuracy of those classifications, so it has many additional steps to perform!
Testing the classifier in particular is the very time-consuming part (since you have many thousands of sequences!) so the times you quoted make sense, given the inputs and parameters… if in doubt, try classifying the SILVA sequences with either of your new classifiers, it should take ~26 hr.