feature classifier memory requirements

Hello,

I am following RESCRIPt tutorial and want to train my own classifiers with SILVA for specific regions like V1-V2, V3-V4 etc.

I have 16 GB RAM and 32 GB swap memory and still, I get memory error.

I have seen some arguments about memory requirements for this step, and I understand that it is not precise.

I’m willing to buy some more memory but I need to know how much more I need. Yes, it is not precise but there must be an estimation.

Thank you…

Hi @the_dummy,

Thanks for giving RESCRIPt a spin!

You could check out the reads-per-batch parameter to reduce memory use. Multithreading will increase the memory demand so do not use the threads parameter to limit this.

For SILVA we’ve heard a bunch of different reports… e.g., some have reported needing 32GB, others (with the advice I’ve given above) can squeeze by with less!

One thing is for sure — if you use RESCRIPt to grab the SILVA data and use the --p-no-include-species-labels option, it greatly reduces runtime and memory use, since the number of classes is greatly reduced.

Good luck!

3 Likes

I will try that, thank you.

This one is about qiime feature-classifier classify-sklearn, isn’t it? I haven’t trained my classifier successfully, yet.

I’m playing with --p-classify--chunk-size, I will share the progress.

Is there anything that explains the parameters of qiime feature classifier fit-classifier-naive-bayes? If there are more parameters related to memory usage, I would like to know.

I think feature-classifier tutorial needs to be improved, we need more insight to this plugin.

Sincerely.

Since you mentioned RESCRIPt, I assumed you were using evaluate-fit-classifier to fit and test.

None of the other parameters will impact memory usage.

SILVA tends to be a memory hog… reducing database size (e.g., by filtering out low-quality sequences as described in the RESCRIPt tutorial) and reducing label complexity (e.g., dropping species labels as I described above) are the only real ways to reduce memory demand during training. Chunk size might help, but in my experience it does not.

Good luck!

1 Like

YES! It really differs to drop species label. After the steps in RESCRIPt tutorial and reducing it to V3-V4 specific region, I successfully trained my classifiers very very quickly. But still, I continued my experiments with species labels included database.

Numbers below belong to species labels included database.

evaluate-fit-classifier with --p-reads-per-batch 10000 and --p-n-jobs 1 parameters worked for 32 hours. Memory usage almost never went over 16 GB, usually at 8 GB.

Whereas, fit-classifier-naive-bayes with --p-classify--chunk-size 10000 parameter took 6 hours. Maximum memory usage was 42 GB and it usually used 30 GB.

In RESCRIPt tutorial, it is mentioned that evaluate-fit-classifier and fit-classifier-naive-bayes are almost the same, but seems like they require different amounts of memory and time. Why does this happen? Could you explain it in simple terms, please?

Thank you very much for sparing time to answer such simple questions in this topic.

1 Like

The trained classifier output by them should be identical if the inputs are identical.

but that does not mean that they are the same or take the same amount of time to run… evaluate-fit-classifier fits the classifier, tests it on the same set of sequences, and then evaluates the accuracy of those classifications, so it has many additional steps to perform!

Testing the classifier in particular is the very time-consuming part (since you have many thousands of sequences!) so the times you quoted make sense, given the inputs and parameters… if in doubt, try classifying the SILVA sequences with either of your new classifiers, it should take ~26 hr.

2 Likes