I thought I’d give an update on my ‘progress’ with this. Importing taxonomy and extracting reads has worked fine, however now I’ve run into a (potentially unrelated) issue with feature-classifier fit-classifier-naive-bayes
, which I’m trying to work out. I’m basing my workflow off this tutorial.
Import RDP data into qza format
# rdp taxonomy strings
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path rdp_qiime_taxonomy2.txt \
--output-path rdp-ref-taxonomy.qza
# download and import KMaki's fasta file of RDP sequences
$ wget --no-check-certificate https://ndownloader.figshare.com/files/23146310?private_link=86d9b343729e4c67ea08 -O rep_set_99_rdp.fa
qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path rep_set_99_rdp.fa \
--output-path rdp_16S_otus.qza
Extract reads from the relevant region (V3-V5) out of the ref database
$ qiime feature-classifier extract-reads \
--i-sequences rdp_16S_otus.qza \
--p-f-primer CCTACGGGNGGCWGCAG \
--p-r-primer GACTACHVGGGTATCTAATCC \
--p-min-length 300 \
--p-max-length 600 \
--o-reads ref_seqs.qza \
--verbose \
&> 16S_training.log
Train classifier on extracted region:
$ qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ref_seqs.qza \
--i-reference-taxonomy rdp-ref-taxonomy.qza \
--o-classifier rdp_classifier_16S.qza \
--verbose \
&> 16S_classifier.log
This produces an error message regarding memory (see below). Based on these posts here, I added --p-classify–-chunk-size XXXX
to the command (tried XXXX = 5000, 2000, 1000, 500, 100, 50 and 10), however it still produced the exact same error (I ran the logs through a diffchecker and they are indentical)
Here is the error message
/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.22.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
warnings.warn(warning, UserWarning)
Traceback (most recent call last):
File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/commands.py", line 328, in __call__
results = action(**arguments)
File "</homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/decorator.py:decorator-gen-345>", line 2, in fit_classifier_naive_bayes
File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
output_types, provenance)
File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in _callable_executor_
output_views = self._callable(**view_args)
File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/classifier.py", line 331, in generic_fitter
pipeline)
File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/_skl.py", line 32, in fit_pipeline
pipeline.fit(X, y)
File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/pipeline.py", line 350, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/pipeline.py", line 315, in _fit
**fit_params_steps[name])
File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/joblib/memory.py", line 355, in __call__
return self.func(*args, **kwargs)
File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/pipeline.py", line 728, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 809, in fit_transform
return self.fit(X, y).transform(X)
File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 784, in transform
X = self._get_hasher().transform(analyzer(doc) for doc in X)
File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/feature_extraction/_hash.py", line 155, in transform
self.alternate_sign, seed=0)
File "sklearn/feature_extraction/_hashing_fast.pyx", line 83, in sklearn.feature_extraction._hashing_fast.transform
File "<__array_function__ internals>", line 6, in resize
File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 1415, in resize
a = concatenate((a,) * n_copies)
File "<__array_function__ internals>", line 6, in concatenate
MemoryError: Unable to allocate 8.00 GiB for an array with shape (1073741824,) and data type float64
Plugin error from feature-classifier:
Unable to allocate 8.00 GiB for an array with shape (1073741824,) and data type float64
See above for debug info.
I’m unsure why I still get a memory error despite giving the --p-classify–-chunk-size XXXX
command.