Error with alignment with MAFFT

Luna · November 29, 2022, 2:45pm

Hello,

This is my first post here, so I apologize if the format is wrong or if I use the wrong tag.

I am using qiime2 to make a phylogenetic tree. So far, I have converted my fasta file to a qza file with the following code:
qiime tools import
--type 'FeatureData[Sequence]'
--input-path ./sequences.fasta
--output-path ./sequences.qza

I am stuck on the alignment code:
qiime alignment mafft
--i-sequences sequences.qza
--o-alignment aligned-sequences.listed.qza

This is the error I get:
Plugin error from alignment:

Invalid character in sequence: b'\t'.
Valid characters: ['A', 'B', 'G', 'C', 'M', 'W', 'S', '.', 'V', 'N', 'T', 'R', 'K', 'D', 'H', 'Y', '-']
Note: Use lowercase if your sequence contains lowercase characters not in the sequence's alphabet.

And this is the debug info:
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/q2cli/commands.py", line 339, in call
results = action(**arguments)
File "", line 2, in mafft
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
outputs = self.callable_executor(scope, callable_args,
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/qiime2/sdk/action.py", line 391, in callable_executor
output_views = self._callable(**view_args)
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/q2_alignment/_mafft.py", line 128, in mafft
return _mafft(sequences_fp, None, n_threads, parttree, False)
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/q2_alignment/_mafft.py", line 104, in _mafft
msa = skbio.TabularMSA.read(result_fp, format='fasta',
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/skbio/io/registry.py", line 652, in read
return registry.read(file, into=cls, format=format, **kwargs)
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/skbio/io/registry.py", line 513, in read
return self._read_ret(file, format, into, verify, kwargs)
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/skbio/io/registry.py", line 520, in _read_ret
return reader(file, **kwargs)
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/skbio/io/registry.py", line 998, in wrapped_reader
return reader_function(fhs[-1], **kwargs)
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/skbio/io/format/fasta.py", line 748, in _fasta_to_tabular_msa
return TabularMSA(
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/skbio/alignment/_tabular_msa.py", line 785, in init
self.extend(sequences, minter=minter, index=index,
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/skbio/alignment/_tabular_msa.py", line 1956, in extend
sequences = list(sequences)
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/skbio/io/registry.py", line 1008, in wrapped_reader
yield from reader_function(fhs[-1], **kwargs)
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/skbio/io/format/fasta.py", line 676, in fasta_to_generator
yield constructor(seq, metadata={'id': id, 'description': desc},
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/skbio/sequence/_grammared_sequence.py", line 326, in init
self._validate()
File "/Users/luna/opt/anaconda3/envs/qiime2-2021.11/lib/python3.8/site-packages/skbio/sequence/_grammared_sequence.py", line 342, in _validate
raise ValueError(
ValueError: Invalid character in sequence: b'\t'.
Valid characters: ['A', 'B', 'G', 'C', 'M', 'W', 'S', '.', 'V', 'N', 'T', 'R', 'K', 'D', 'H', 'Y', '-']
Note: Use lowercase if your sequence contains lowercase characters not in the sequence's alphabet.

I am running qiime2-2021.11 using visual studio code

Thank you so much in advance for taking the time to read this!!

SoilRotifer · November 29, 2022, 5:15pm

Hi @Luna,

This is message:

is letting you know that there is an invalid character i.e. b'\t' in your sequence. Only the ones listed as valid are allowed. How did you prepare the fasta file? Can you share the sequences.fasta and the sequences.qza with me? If so, you can link me the files via private DM, i.e. through Dropbox or other equivalent, if the files are large.

Luna · November 30, 2022, 3:22am

Hello @SoilRotifer ,

Thank you so much for replying!
I made the fasta file with excel, I copy and pasted sequences from the ncbi nucleotide library, I actually had trouble with converting it from fasta to qza because of formatting issues like quotation marks ' " ' in some lines even thought it never showed in excel (but did in text editor) could this be related? Where '\t' is for tab?

sequences_1.qza (11.9 KB)
This is the qza file, but I can't seem to upload fasta files here... I will try to make a Dropbox link.

Luna · November 30, 2022, 3:22am

I just figured it out! It turns out that the '\t' were actually tabs and I found them using the "Show Invisibles" option on CotEditor. Then I could go in and remove them and the code ran with no problems.

SoilRotifer · November 30, 2022, 2:31pm

This is what I was expecting to find. Glad you figured it out!

In the future I recommend using a raw text editor, like the one you used (CotEditor), or others like BBEdit, Notepad, etc... and not a spreadsheet or word processing tools to view / edit / build your files. As you can see, these often introduce hidden whitespace and other characters into your data.

Good luck!

system · December 31, 2022, 8:31pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.