MemoryError When Training UNITE ver7 01.12.2017 Classifier

Sydney_Morgan · January 28, 2018, 2:24am

I am trying to train a UNITE ver7 01.12.2017 classifier, and have two questions regarding this training. I am following a modified version of the protocol posted by Greg Caporaso on GitHub (GitHub - gregcaporaso/2017.06.23-q2-fungal-tutorial: A quick fungal ITS analysis tutorial for QIIME 2).

I am first wondering why the "feature-classifier extract-reads" command that includes the specific primers used for your samples doesn't seem to be necessary when using fungal ITS data? It was not included in the tutorial I followed and I still got good results (species-level identification for many features), but I know when using 16S data this step is important. Is there a difference between the two databases that makes this command necessary for bacterial identification but not for fungal identification?

My second question is regarding an error I am currently receiving when training the UNITE classifier using the "fit-classifier-naive-bayes" command. I have run this command previously with success, but I was using Docker before and am now using Virtual Box, so I'm wondering if that is the problem? The command I am using is as follows:

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads unite-ver7-dynamic-seqs-01.12.2017.qza
--i-reference-taxonomy unite-ver7-dynamic-tax-01.12.2017.qza
--o-classifier unite-ver7-dynamic-classifier-01.12.2017.qza
--p-classify--chunk-size 20000
--verbose

I added the --p-classify--chunk-size 20000 \ based on the recommendations from a similar error someone received when training the Silva classifier (and also tried with a chunk size of 10000), but it did not fix the problem. The error is as follows:

/home/qiime2/miniconda/envs/qiime2-2017.12/lib/python3.5/site-packages/q2_feature_classifier/classifier.py:101: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.19.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
warnings.warn(warning, UserWarning)
Traceback (most recent call last):
File "/home/qiime2/miniconda/envs/qiime2-2017.12/lib/python3.5/site-packages/q2cli/commands.py", line 224, in call
results = action(**arguments)
File "", line 2, in fit_classifier_naive_bayes
File "/home/qiime2/miniconda/envs/qiime2-2017.12/lib/python3.5/site-packages/qiime2/sdk/action.py", line 228, in bound_callable
output_types, provenance)
File "/home/qiime2/miniconda/envs/qiime2-2017.12/lib/python3.5/site-packages/qiime2/sdk/action.py", line 363, in callable_executor
output_views = self._callable(**view_args)
File "/home/qiime2/miniconda/envs/qiime2-2017.12/lib/python3.5/site-packages/q2_feature_classifier/classifier.py", line 310, in generic_fitter
pipeline)
File "/home/qiime2/miniconda/envs/qiime2-2017.12/lib/python3.5/site-packages/q2_feature_classifier/_skl.py", line 32, in fit_pipeline
pipeline.fit(X, y)
File "/home/qiime2/miniconda/envs/qiime2-2017.12/lib/python3.5/site-packages/sklearn/pipeline.py", line 250, in fit
self._final_estimator.fit(Xt, y, **fit_params)
File "/home/qiime2/miniconda/envs/qiime2-2017.12/lib/python3.5/site-packages/q2_feature_classifier/custom.py", line 41, in fit
classes=classes)
File "/home/qiime2/miniconda/envs/qiime2-2017.12/lib/python3.5/site-packages/sklearn/naive_bayes.py", line 555, in partial_fit
self._update_feature_log_prob(alpha)
File "/home/qiime2/miniconda/envs/qiime2-2017.12/lib/python3.5/site-packages/sklearn/naive_bayes.py", line 717, in update_feature_log_prob
self.feature_log_prob = (np.log(smoothed_fc) -
MemoryError

Also based on a recommendation from the same Silva thread I ran the command ulimit -a which produced the following:

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 15632
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 15632
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Is there anything else I can try, or do I need to find a new machine to run this command?

Nicholas_Bokulich · January 28, 2018, 2:36am

Hi @Sydney_Morgan,
Thanks for posting!

For bacterial 16S rRNA reads, we see a performance boost when the feature classifier is trained on extracted sequence reads, compared to the near-full-length 16S rRNA gene sequences. For fungal ITS reads, we see a performance decrease upon extraction.

The reason for this is primarily because the reference database is composed of ITS sequences amplified by a range of different primers, and hence do not overlap 100%. Depending on the primers that you choose, many of the reference sequences will fail to extract simply because that primer sequence is not in those particular reference sequences, not necessarily because the primer does not amplify that species. It also does not help that UNITE trims its sequences to remove flanking rRNA gene regions (which contain primer sites) — you must use the "developer" version of the database to retrieve the full-length reads (which still suffer from the issue I've described above).

How much memory are you allocating to the virtual machine? If you can allocate more, do. In my experience, it does not take much memory (< 8 GB) to train a UNITE classifier. An even lower chunk size may help; a different machine may be the last resort if all else fails. You could also check out these previous forum posts (here and here) to see if others have offered additional solutions.

I hope that helps! Good luck!

Sydney_Morgan · January 31, 2018, 1:34am

Thank you for the information and the help! I have allocated 4GB to the virtual machine, and I can't allocate much more, so that is likely the sticking point. I did run the command on a more powerful machine and it worked with no hiccups so I will continue to do that when I need to train classifiers in the future.

Nicholas_Bokulich · January 31, 2018, 1:37am

Excellent! The good news is that training the classifier only needs to happen rarely (e.g., when switching marker genes or reference databases), certainly not every analysis.

In the future, pre-trained UNITE classifiers could conceivably also be released on the UNITE or QIIME2 websites but I certainly do not want to make any promises...

Happy QIIMEing

system · March 3, 2018, 7:37am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.