Better location information for "utf-8 can't decode codec..." error

nick-youngblut · December 6, 2017, 2:00pm

My metadata table seems to contain >=1 non-ASCII character, so when I try to use it for commands such as qiime taxa barplot, I get the following error:

There was an issue with loading the file /path/to/my/file/metadata.txt as metadata:

  'utf-8' codec can't decode byte 0xc4 in position 7724: invalid continuation byte

My metadata table is 3289 rows x 80 columns, so I'm having a hard time finding "position 7724". Does that mean the 7724th value in the metadata table (row-wise?), the 7724th character, or something else? "position 7724" can't be row or column.

It would be helpful if this error reported the row and column instead of the "position".

jairideout · December 6, 2017, 10:55pm

Hi @nick-youngblut! This error message is being raised by Python when attempting to read the TSV file. I agree it's not the most user-friendly error message. The error message is stating the value of the byte that can't be decoded (0xc4) -- I think this would be the 7725th byte in your file ("position" makes me think that the numbers are 0-based indexing instead of 1-based indexing).

You can check what encoding your file has by using one of the Unix tools described here. That may shed some light on the issue -- it looks like you have a Unicode file encoded with something other than utf-8. Re-encoding to utf-8 may solve the issue.

I'm working on overhauling Metadata in qiime2 for this month's release (2017.12), and part of the work will include better error messages for these types of situations. Would you mind sharing your metadata file with me so that I can make sure this situation is handled better? Feel free to send me a DM if you don't want to share the file publicly. Thanks!

nick-youngblut · December 7, 2017, 8:07am

Thanks for the explanation! After a bit of trial and error (removing columns from the metadata table and seeing if the table would load), I was able to find the value that caused the error "start of box 12.17ƒ had to use 2nd aliquot b/c 1st was too small". Removing the "ƒ" fixed the issue. If you add this character to any metadata table, you should be able to reproduce my error.

jairideout · December 7, 2017, 8:16pm

Thanks for the details @nick-youngblut, I will definitely test that out! I'm specifically curious about the encoding of the metadata file that you exported -- the file contains a non-ASCII character ( “ƒ”), but that character can be encoded in the file a number of ways. Would you be able to run the following command on your metadata TSV file that contained the non-ASCII character, and send me the output? That would be helpful!

If you're on Linux:

file -i /path/to/my/file/metadata.txt

If you're on macOS:

file -I /path/to/my/file/metadata.txt

Thanks!

nick-youngblut · December 8, 2017, 8:00am

Ok, I ran file -i on my metadata file (I'm using Ubuntu 16.04.3) and the encoding is plain us-ascii:

metadata.txt: text/plain; charset=us-ascii

I hope that helps!

antgonza · December 8, 2017, 4:02pm

Hello,

In the past, I have dealt with this issue by removing all the non-valid chars. Not elegant but it allows me to move forward and then fix the metadata I care based on the results/visualizations I want to use.

Anyway, a few options:

sed 's/[^[:print:]\r\t]//g' < your_input_file > your_output_file
grep -axv '.*' your_input_file > your_output_file
sed -i 's/[\d128-\d255]//g' your_input_file > your_output_file

Hope this helps

jairideout · December 8, 2017, 6:06pm

Thanks @nick-youngblut! Interesting, I would have expected a non-ASCII encoding. Did you run this command on the TSV metadata file that was originally causing problems (i.e. the file with “ƒ”)? That's specifically what I'm after here. No worries if this is too much of a pain -- I can probably put together some unit tests to handle the case you ran into.

Thanks @antgonza! Just a note for anyone following along, this workaround shouldn't be necessary after the metadata overhaul is released this month. I'll follow up here when this is supported in the 2017.12 release!

jairideout · February 16, 2018, 4:57pm

In the QIIME 2 2018.2 release, Metadata now has explicit support for UTF-8 files and Unicode characters. If a non-UTF-8 metadata file is used, a more informative error message is displayed.

There are a number of other changes to QIIME 2 Metadata in the 2018.2 release. See this forum announcement for details on what changed, as well as the updated Metadata tutorial.

thermokarst · March 12, 2018, 1:25pm

An off-topic reply has been merged into an existing topic: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Please keep replies on-topic in the future.