Plugin error from feature-classifier: Found non-header line when attempting to read the 1st record:

Brightbeard · July 2, 2020, 8:15pm

Hi, all,

I'm getting a funny error using the naive bayesian classifier on the eukaryote Unite database and I can't quite figure out a workaround. There doesn't seem to be anything on the forums like this, either. I'm running the Qiime2 container v. 2019.7 (I know, its old...) on Docker and have had no trouble when using the classifier on the fungal-only database. Furthermore, when I compare both databases (fasta and taxa files) to one another, they look exactly the same.

The import commands for the sequences and taxonomy files produce the correct .qza files. The issue is with this command:

qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads //c/Sequencing/Spike_R_2/UNITE_train_set_oom.qza --i-reference-taxonomy //c/Sequencing/Spike_R_2/UNITE_taxa_oom.qza --o-classifier //c/Sequencing/Spike_R_2/UNITE_classifier_BW_oom.qza --verbose

This results in:

/opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/qiime2-archive-75mbqgx5/b9c2c328-b05f-47cd-b763-b13169c1657d/data/dna-sequences.fasta'> does not look like a fasta file
% (file, fmt), FormatIdentificationWarning)
Traceback (most recent call last):
File "/opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/q2cli/commands.py", line 327, in call
results = action(**arguments)
File "</opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/decorator.py:decorator-gen-349>", line 2, in fit_classifier_naive_bayes
File "/opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/sdk/action.py", line 229, in bound_callable
spec.view_type, recorder)
File "/opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/sdk/result.py", line 289, in _view
result = transformation(self._archiver.data_dir)
File "/opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/core/transform.py", line 70, in transformation
new_view = transformer(view)
File "/opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/core/transform.py", line 213, in wrapped
return transformer(view.file.view(self._wrapped_view_type))
File "/opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_types/feature_data/_transformer.py", line 264, in _9
generator = _read_dna_fasta(str(ff))
File "/opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_types/feature_data/_transformer.py", line 240, in _read_dna_fasta
return skbio.read(path, format='fasta', constructor=skbio.DNA)
File "/opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/skbio/io/registry.py", line 1161, in read
**kwargs)
File "/opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/skbio/io/registry.py", line 506, in read
return (x for x in itertools.chain([next(gen)], gen))
File "/opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/skbio/io/registry.py", line 531, in _read_gen
yield from reader(file, **kwargs)
File "/opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/skbio/io/registry.py", line 1008, in wrapped_reader
yield from reader_function(fhs[-1], **kwargs)
File "/opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/skbio/io/format/fasta.py", line 675, in _fasta_to_generator
FASTAFormatError):
File "/opt/conda/envs/qiime2-2019.7/lib/python3.6/site-packages/skbio/io/format/fasta.py", line 849, in _parse_fasta_raw
"\n%s" % seq_header)
skbio.io._exception.FASTAFormatError: Found non-header line when attempting to read the 1st record:
>SH1140862.08FU_HM100661_reps_singleton

Plugin error from feature-classifier:

Found non-header line when attempting to read the 1st record:
>SH1140862.08FU_HM100661_reps_singleton

The "dna-sequences.fasta'> does not look like a fasta file" is frustrating, because it looks exactly like the other fasta file that works. Additionally, my first line is:

SH1140862.08FU_HM100661_reps_singleton
CTGAGCTGTCGACACGAGCTGTTGCTGGTCCTCAAACAAGGGGGCATGTGCACGCTCTGTTCACACATCTACTCACAGGTGCACCGTCTGTAGTTTTATGGTCTGGGGGACACACCGTCTTCCTCCCGTGGCTCTACGTCTTTACACACACATCGTAGTTAAGTTTTATGGAATGTGCATCGCTTTTAACGTAATACAATACAACTTTCAGCAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAGCGAAATGCGATAAGTTATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACCTTGCGCCCCTTGGCTATTCCGAGGGGCATGCCTGTTTGAGTATCATGAACACCTCAACTCCTCATGTTTCCCGTGATGAGCTTGGACTTCTGGAGGTTTTGCTTACCTGCGGTCTCTCCTCTCAAACGCATCAGCTTGCCAGTGTTTGGTGGCATCACTGGTGAGATAACTATCTATGCTCGTGGCCGTCTGCCAGATAACCTTCAGCGATGGAGGTTTGCTTGAGCTCACAAAGGTCTTTCCACAGCCAAGACTGCTTTTTTAACTTTCGATCTCAAATCCCGTAGGACACCCGCTGAACCGTAGCTGACTAGCGCGCCTAA

Which is similar to the first line of the fasta file that works, down to the same delimiter:

SH1140860.08FU_HF674537_reps_singleton
CATTACCGAATTGTCGACACGAGTTGTTGCTGGTCCCCAAACGGGGGCACGTGCACGCTCTGTTTGTACATCCATTCACACCTGTGCACCCCATGTAGTTCTGTGGTTTGGGGGACTCTGTCCTCTCGCCGTGGTTCTATATCTTTACACACGCTCTGTAATAAAGTCTCATGGAATGTATGCAGCGTTTAACGCAATACAATACAACTTTCAGCAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACCTTGCGCCCTTTGGCTATTCCGAAGGGCATGCCTGTTTGAGTATCATGAACACCTCAACTCTCATGGTTCGCCGTGATGAGCTTGGACTTTGGGGGTCTTGCTGGCCTGCGGTCGGCTCCCTTCAAATGAATCAGCTTTCCAGTGTTTGGTGGCATCACGGGTGTGATAAATATCTACGCTTGTGGTTTCCGGAGGATCATTTCCGAATTGGTGGCACGAAGTGGTGGTTGGTCCCAAACGGGGGCAAGTGCCCGGTTTGGTTGTACCATCCATTACCCCTTGGCACCCCNAGGAGGTTTGGGGGTTGGGGGGATTCGTTCTTTTGCCGGGGTTTTATATTTTTACCCCCGGTTTGTAATAAAATTTCCAGGAAAGGAAGCAGGGTTTAAAGCCATTCCATTCCAATTTTAGCAAAGGATTTTTTGGGTTTTGGCTTGGAGAAGGAAGCAAGGAAAATGGGTAAGTAAAGGGAAATGCCGAAATCAAGGAATTCTTGGATTTTTGAACGCCCCCTGGGCCCCTTGGGTTTTTCGAAGGGCCAGCCTGTTTGAGGATTCAGAACCCCTTAAATTTCCAGGTTTGCCGGGGGGAGGCTGGGACTTGGGGGTTCTGGTGGCCTGCGGTTGGCTCCCTTCAAAAGAATTCACTTTCCCAGGTTTGGGGGCCTCCCGGGGGGGAAAAAAATTNACGGCTGGGGGTTTCCGCCAGGTAACCTTCAGTGATGGAGGTTCGCTGGGGCTCATAAATGTCTCTCCTCAGCGAAGACAG

Does anyone have any thoughts about what I can do to fix this? I'm going to continue to play with the fasta file, since I've had to format it correctly from the version downloaded from Unite.

Cheers,

Oddant1 · July 2, 2020, 8:24pm

Hello @Brightbeard. Maybe I'm missing something here, but I'm confused as to how that second fasta file actually does work. Fasta headers must start with >, and in your error message it looks like your header starts with a >, but when you actually post the snippets of the fasta files neither of the headers start with a >. Does the header in the file that isn't working start with >? Does the header in the file that is working start with >?

Brightbeard · July 2, 2020, 8:25pm

They both do. It appears > on a new line triggers some formatting I was unaware of!

Oddant1 · July 2, 2020, 8:37pm

Ok cool, that makes sense. I went looking through the skbio codebase, and it looks like that error is only raised if your first line (ignoring any blank lines) does not start with a >. . . but yours does. Can you PM me the .qza files you used in the run that errored so I can look at them more closely?

Oddant1 · July 2, 2020, 9:18pm

Thank you for PMing me the data. As I suspected, you have 3 bytes of data sitting in front of the > on the first line of your fasta file. These bytes form what is called a Byte Order Mark or BOM. This data is visible when the file is opened with a hexeditor (the highlighted bit below).

This is almost certainly what is confusing the validation of your file. Fortunately, Byte Order Marks aren't very useful in most contexts (at least not for our purposes here), and you should be able to remove it without harming your data in any way. There are a few ways to get rid of this byte order mark, but before I suggest some can you tell me what kind of environment you're running QIIME 2 in (what operating system, is it a native install or a VM, is it an HPC, etc.)? Additionally if you already know how to handle this issue without further assistance feel free to just take care of it.

Brightbeard · July 2, 2020, 9:26pm

Interesting! I wonder how that got there? I actually just deleted and replaced the ">" in my first line and the classifier is off and running now. I didn't expect it to work, since it was there before, but I must have deleted those three bites without even realizing that was the problem. I'm on a Windows 10 OS, running Qiime 2 through a container on Docker.

The fasta file I downloaded had duplicate entries for each sequence (and many errors where base pairs weren't capitalized... weird for Unite), so I used R to remove those duplicates. R must have added those bites when I wrote out the edited file.

Oddant1 · July 2, 2020, 9:28pm

Yeah there is a decent chance that R inserted those bytes in there somewhere along the way depending on what text encoding you chose to save the file as (computers can encode text in many different, complicated, and sometimes only semi-compatible ways). Glad you were able to get it working!

Brightbeard · July 2, 2020, 9:31pm

Same here. I'm not well-versed in these matters, so I appreciate the help.

system · August 3, 2020, 3:31am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.