Unfortunately, whether I go through the process of importing the taxonomy and reference sequence data from Silva 138 or use the two files already processed and the 515/806 primer set already extracted (Data resources — QIIME 2 2021.4.0 documentation, my computer suffers a memory error and cannot continue after several hours, despite dedicating 30 GB of RAM and 6 CPUs to the process, which was sufficient for creating a classifier for the Silva 132 database. To be clear, I'm using this code:
How much RAM is your virtual box set up to use? Have you allocated sufficient RAM?
I've been able to construct and train the V4 (515-806) classifier on my laptop with 16 GB RAM... though that was indeed pushing it ! But it also depends on what else your system is doing at the time.
Again, my first thought would be to make sure your virtual box has access to at least 16-24 GB RAM when it is running. The default might only be 2-8 GB RAM ?
Thank you for responding so quickly. I've allocated as much RAM to 2021.4 as our computer will allow me (30.3 GB), and nothing else aside from Virtualbox and Qiime are open, which is why I'm thinking this problem might be specific only to me, but as I said, these settings were sufficient to create a classifier with version 132 of the Silva database, so I don't know why this one is failing.
Worse case scenario, I can just use the pre-trained classifier available on the qiime 2 website...it's just that I've been told that it's always best to make your own
I don't think its specific only to you - for reference, when we re-train the feature classifiers for new QIIME 2 releases, we have to use 64 GB of RAM on our HPC, using the default "chunk size" setting of 20,000. 30 GB seems insufficient to me to get the job done, esp if you're observing a memory error - one option might be to try cutting the --p-classify--chunk-size param in half, reducing the memory burden (but it will ~double the runtime).
Thank you for clarifying. 64 GB of RAM seems like something not a lot of people have (at least my advisor's lab doesn't). At the very least that should be made clear on the tutorial how much RAM you need for this process, especially if it's the only way you can get the latest Silva databases, and that the --p-classify--chunk size command is necessary if your computer doesn't have that much RAM.
I'm not certain whether it would be better to use the premade classifier online or use the --p-classify--chunk size command, but I'll stick with the former option for now
The RAM requirements are entirely dependent on the reference database you are using to generate the classifier with (and to complicate matters, any trimming/extraction will impact that, as well). Training with a greengenes DB with just a few GB of RAM is common. Unfortunately, its not a "one size fits all" situation.
If you're using the 515f-806r primers (and judging by your first post, it sounds like you are), then using a pre-trained classifier will be identical to training your own, assuming you weren't applying some kind of intermediate filtering or cleanup. Usually we recommend folks train their own to accommodate unique environments, custom databases, or to handle their specific primers used. Sounds like you're using a pretty common setup and can use the pretrained classifiers, confidently.