Error from feature-classifier: indices and data should have the same size

Hi, I try to create a classifier for the complete RDP-database in the qiime vm (QIIME 2 Core 2018.6):

I take the Unaligned Bacteria 16S fasta file from here (3.8GB).

To start with the workflow, I need the database file seperated into taxonomy and otu file, doing this with a python script. It seperates the one RDP-file into two (otu-file (3.2GB) taxo-file (333MB)) regarding the correct format:

taxo example line:

494589	k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Acidimicrobiales; f__Acidimicrobiaceae; g__Acidimicrobium; s__

corresponding otu line:

>494589
GCGGCGTGCTACACATGCAGTCGTACGCGGTGGCACACCGAGTGGCGAACGGGTGCGTAAC....

Importing these files into qiime artifacts works just fine, so I am able to start

qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads rdp_otus.qza --i-reference-taxonomy rdp_taxa.qza --o-classifier rdp_classifier.qza

After a while I get the following error (on screen and in log file):
indices and data should have the same size

To make sure I didn’t mess it up in my converting step, I checked that the ids are the same in both files.
Furthermore I tried the same command with parts of the files (for example a third of the files, or 5/6 …). It still worked without problems.

It seems to me that the amount of data is getting to big. But there is no memory error (my DRAM seems to be enough?)

Who knows this kind of error message?
What am I doing wrong?
Is there some kind of a limit for the maximum size of the files or the amount of the sequences?

Thank you for helping me

1 Like

Thanks for describing your steps to debug!

Could you please report the full error message from the log file?

No, and indeed there would be an error message.

Thanks for checking — this sounds more like what the error implies (but I’d need to see the full error message to be sure)

So there may be a formatting error or special character or something that slipped into one part of the file. Do all chunks work?

1 Like

Full error message:

home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/q2_feature_classifier/classifier.py:101: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.19.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
  warnings.warn(warning, UserWarning)
Traceback (most recent call last):
  File "/home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/q2cli/commands.py", line 274, in __call__
    results = action(**arguments)
  File "<decorator-gen-294>", line 2, in fit_classifier_naive_bayes
  File "/home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/qiime2/sdk/action.py", line 232, in bound_callable
    output_types, provenance)
  File "/home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/qiime2/sdk/action.py", line 367, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/q2_feature_classifier/classifier.py", line 316, in generic_fitter
    pipeline)
  File "/home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/q2_feature_classifier/_skl.py", line 32, in fit_pipeline
    pipeline.fit(X, y)
  File "/home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 248, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 213, in _fit
    **fit_params_steps[name])
  File "/home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/sklearn/externals/joblib/memory.py", line 362, in __call__
    return self.func(*args, **kwargs)
  File "/home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/sklearn/base.py", line 520, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 519, in transform
    X = self._get_hasher().transform(analyzer(doc) for doc in X)
  File "/home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/sklearn/feature_extraction/hashing.py", line 167, in transform
    shape=(n_samples, self.n_features))
  File "/home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/scipy/sparse/compressed.py", line 98, in __init__
    self.check_format(full_check=False)
  File "/home/qiime2/miniconda/envs/qiime2-2018.6/lib/python3.5/site-packages/scipy/sparse/compressed.py", line 167, in check_format
    raise ValueError("indices and data should have the same size")
ValueError: indices and data should have the same size

Yes every chunk and chunk combination I tried worked.

1 Like

You are correct — this is related to the number of sequences, but not due to memory constraints.

@BenKaehler reported this bug to scikit-learn — did you figure out a way to address this in q2-feature-classifier @BenKaehler?

If you have not done so already, I would recommend dereplicating your sequences to reduce the number of unique sequences, rather than training on the full sequence set.

Otherwise, let’s see what Ben has to say!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.