RDP Reference Database in QIIME2 format

Matt · June 24, 2020, 10:22pm

I thought I'd give an update on my 'progress' with this. Importing taxonomy and extracting reads has worked fine, however now I've run into a (potentially unrelated) issue with feature-classifier fit-classifier-naive-bayes, which I'm trying to work out. I'm basing my workflow off this tutorial.

Import RDP data into qza format

# rdp taxonomy strings
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path rdp_qiime_taxonomy2.txt \
--output-path rdp-ref-taxonomy.qza

# download and import KMaki's fasta file of RDP sequences
$ wget --no-check-certificate https://ndownloader.figshare.com/files/23146310?private_link=86d9b343729e4c67ea08 -O rep_set_99_rdp.fa

qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path rep_set_99_rdp.fa \
--output-path rdp_16S_otus.qza

Extract reads from the relevant region (V3-V5) out of the ref database

$ qiime feature-classifier extract-reads \
--i-sequences rdp_16S_otus.qza \
--p-f-primer CCTACGGGNGGCWGCAG  \
--p-r-primer GACTACHVGGGTATCTAATCC   \
--p-min-length 300 \
--p-max-length 600 \
--o-reads ref_seqs.qza \
--verbose \
&> 16S_training.log

Train classifier on extracted region:

$ qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ref_seqs.qza \
--i-reference-taxonomy rdp-ref-taxonomy.qza \
--o-classifier rdp_classifier_16S.qza \
--verbose \
&> 16S_classifier.log

This produces an error message regarding memory (see below). Based on these posts here, I added --p-classify–-chunk-size XXXX to the command (tried XXXX = 5000, 2000, 1000, 500, 100, 50 and 10), however it still produced the exact same error (I ran the logs through a diffchecker and they are indentical)

Here is the error message

/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.22.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
  warnings.warn(warning, UserWarning)
Traceback (most recent call last):
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/commands.py", line 328, in __call__
    results = action(**arguments)
  File "</homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/decorator.py:decorator-gen-345>", line 2, in fit_classifier_naive_bayes
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
    output_types, provenance)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/classifier.py", line 331, in generic_fitter
    pipeline)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/_skl.py", line 32, in fit_pipeline
    pipeline.fit(X, y)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/pipeline.py", line 350, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/pipeline.py", line 315, in _fit
    **fit_params_steps[name])
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/joblib/memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/pipeline.py", line 728, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 809, in fit_transform
    return self.fit(X, y).transform(X)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 784, in transform
    X = self._get_hasher().transform(analyzer(doc) for doc in X)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/feature_extraction/_hash.py", line 155, in transform
    self.alternate_sign, seed=0)
  File "sklearn/feature_extraction/_hashing_fast.pyx", line 83, in sklearn.feature_extraction._hashing_fast.transform
  File "<__array_function__ internals>", line 6, in resize
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 1415, in resize
    a = concatenate((a,) * n_copies)
  File "<__array_function__ internals>", line 6, in concatenate
MemoryError: Unable to allocate 8.00 GiB for an array with shape (1073741824,) and data type float64

Plugin error from feature-classifier:

  Unable to allocate 8.00 GiB for an array with shape (1073741824,) and data type float64

See above for debug info.

I'm unsure why I still get a memory error despite giving the --p-classify–-chunk-size XXXX command.