Plugin error from feature-classifier: Invalid character in sequence: b'X'.

UGG · May 30, 2019, 1:50pm

Hi,
I want to train a classifier from phytoref database, I have got my ref-seq.qza file: phytoref_ref-seqs.qza (946.4 KB)

and the taxonomy.qza file: phytoref_taxonomy.qza (176.5 KB)

However, when I run the command below:
qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads phytoref_ref-seqs.qza --i-reference-taxonomy phytoref_taxonomy.qza --o-classifier phytoref-classifier.qza

I got this error:
Plugin error from feature-classifier:

Invalid character in sequence: b'X'.
Valid characters: ['G', 'W', 'R', 'M', 'K', 'B', 'V', 'A', '-', '.', 'D', 'C', 'T', 'H', 'S', 'N', 'Y']
Note: Use lowercase if your sequence contains lowercase characters not in the sequence's alphabet.

Debug info has been saved to /tmp/qiime2-q2cli-err-y4eqdzmc.log

Here is the log file:
Traceback (most recent call last):
File "/home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/q2cli/commands.py", line 311, in call
results = action(**arguments)
File "</home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/decorator.py:decorator-gen-349>", line 2, in fit_classifier_naive_bayes
File "/home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/qiime2/sdk/action.py", line 231, in bound_callable
output_types, provenance)
File "/home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/qiime2/sdk/action.py", line 365, in callable_executor
output_views = self._callable(**view_args)
File "/home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/q2_feature_classifier/classifier.py", line 318, in generic_fitter
pipeline)
File "/home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/q2_feature_classifier/_skl.py", line 29, in fit_pipeline
seq_ids, X = _extract_reads(reads)
File "/home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/q2_feature_classifier/_skl.py", line 37, in _extract_reads
return zip([(r.metadata['id'], r._string) for r in reads])
File "/home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/q2_feature_classifier/_skl.py", line 37, in
return zip([(r.metadata['id'], r._string) for r in reads])
File "/home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/q2_types/feature_data/_transformer.py", line 228, in iter
yield from self.generator
File "/home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/skbio/io/registry.py", line 506, in
return (x for x in itertools.chain([next(gen)], gen))
File "/home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/skbio/io/registry.py", line 531, in _read_gen
yield from reader(file, **kwargs)
File "/home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/skbio/io/registry.py", line 1008, in wrapped_reader
yield from reader_function(fhs[-1], **kwargs)
File "/home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/skbio/io/format/fasta.py", line 677, in _fasta_to_generator
**kwargs)
File "/home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/skbio/sequence/_grammared_sequence.py", line 326, in init
self._validate()
File "/home/ugg/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/skbio/sequence/_grammared_sequence.py", line 350, in _validate
list(self.alphabet)))
ValueError: Invalid character in sequence: b'X'.
Valid characters: ['G', 'W', 'R', 'M', 'K', 'B', 'V', 'A', '-', '.', 'D', 'C', 'T', 'H', 'S', 'N', 'Y']
Note: Use lowercase if your sequence contains lowercase characters not in the sequence's alphabet.

I have encountered the same problem with this issue: Plugin Error: feature-classifier classify-sklearn
but the solution suggested there did not solve my problem, I am still getting the same error above.

Could you please help me understand what mistake I have been doing if there is one.
Thanks..

Nicholas_Bokulich · May 30, 2019, 2:31pm

You have "X" at least once in your sequences. That's not a valid nucleotide base! You will need to find and replace/remove this.

That topic describes a similar error, but not the same error. That user had lowercase characters in their sequences, so the same solution will not work for you.

I'd start by seeing how many "X" characters are in your sequences. Then you can either just find/replace or figure out if an automated solution is better for you.

UGG · May 30, 2019, 6:28pm

Thanks a lot, it is my bad not to see the X's in the header lines When I remove them it works..

Nicholas_Bokulich · May 30, 2019, 6:33pm

the header lines should not be the problem... that error indicates that the Xs are in the seqs...

system · July 1, 2019, 12:33am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.