I am having the same problem as several other people who have posted about this topic. I only have four samples with max 3000 samples each. I added --p-pre-dispatch 1 and --p-reads-per-batch 500 to my classify-sklearn command, and the job always ends with "MemoryError." I finally went to AWS but it died in the same way on the instance (I am limited to the free tier, and I guess those instances don't have enough memory for this process either).
My main question is, is there any way to use SILVA to assign taxonomy in QIIME2 without the pre-trained classifier and the "classify-sklearn" command? Just hoping there has been a new development in the ~1 month since this issue was last raised on the forum.
Hi @Nastassia_Patin,
The short answer is no, I do not believe there have been any changes to q2-feature-classifier that reduce the memory required for training very large classifiers (e.g., with SILVA). I do have several tips, however, to help you accomplish what you are trying to do.
What marker gene / domain are you attempting to use? There is a full-length trained SILVA classifier available on the QIIME 2 data resources page. I do not know who trained this classifier or whether it is only 16S or 16S + 18S (either way I am almost certain it would include 16S). @BenKaehler or @gregcaporaso would probably know more about this.
Using the full-length classifier would be fully acceptable, in which case you do not need to take the trouble to train your own classifier. I have benchmarked classifier accuracy using full-length and trimmed classifiers for greengenes and SILVA reference databases. I find that trimming does slightly increase species-level classification accuracy (as reported previously, hence our default recommendation to use trimming); however, the gains are not very dramatic so full-length classifiers should be fine for most users.
There may be other ways to reduce the memory requirements for training a SILVA classifier. Make sure you are only training a 16S or 18S classifier, not 16S + 18S, to significantly decrease the number of input sequences and classes. Use OTU pre-clustered sequences instead of the full sequence set for training; if you are already trying, e.g., 99% OTUs, you could try 97% or lower to further reduce the memory load.
Thanks for the quick reply Nicholas! And I should have clarified the problem - I am actually using one of the pre-trained classifiers already, not training my own. Sorry the title of this thread is misleading, but the original thread discussing a pre-trained Silva classifier was closed. The command I am running is:
I've tried with both the full-length and trimmed trained set but run into the same memory problem for both. I have a very small data set, which is why I'm so confused! I ran the assignment in QIIME1 and it worked fine. I checked my version of scikit-learn and it is 0.19.0. Any thoughts?
It looks like others have successfully used the SILVA classifier with around 20-32 GB RAM, with the exception of this post where the user has a very large data set. Given the size of your data set, I'd expect a lower memory need.
So I should ask: how much RAM do you have? The SILVA classifier does eat up a lot of memory and a standard laptop (e.g., around 8 GB) is probably not enough. It looks like the AWS free tier only has 1 GB of RAM (though perhaps I'm reading the wrong info), which would be woefully inadequate.
If you are limited by memory availability, I can suggest some alternatives:
Use the greengenes pre-trained classifiers instead. They are smaller and less memory-intensive (should work fine on most standard laptops).
Use BLAST or VSEARCH consensus classifiers available in q2-feature-classifier instead. They do not perform quite as well as the naive bayes classifier used in classify-sklearn, but they still do perform very well with appropriate parameter settings. And they should require less memory (I'm not 100% positive but it is worth a try).
This is super helpful, thank you very much! I have less than 1 GB of RAM on my Linux virtual machine so I will have to look for solutions. In the meantime, thanks for those alternatives, I will give them a shot!
Hi @Nastassia_Patin! If you're using VirtualBox check out the qiime2 VirtualBox guide, Step 4 shows how to increase the RAM and CPUs available to the guest operating system (it's in the screenshot with "Appliance Settings"). The default values are usually not enough resources to perform taxonomic classification.
Thanks Jai! I hadn't thought about using the qiime2 VirtualBox. My BioLinux VirtualBox is limited in the amount of RAM I can provide it (although I mistyped earlier, it now has 10000 MB or 10 GB) so I imagine the same problem would be the case for the qiime2 VB.
I would have thought 10 GB would be enough for a small dataset, but no dice! Thanks for the suggestion, though.