My metadata table seems to contain >=1 non-ASCII character, so when I try to use it for commands such as qiime taxa barplot, I get the following error:
There was an issue with loading the file /path/to/my/file/metadata.txt as metadata:
'utf-8' codec can't decode byte 0xc4 in position 7724: invalid continuation byte
My metadata table is 3289 rows x 80 columns, so I’m having a hard time finding “position 7724”. Does that mean the 7724th value in the metadata table (row-wise?), the 7724th character, or something else? “position 7724” can’t be row or column.
It would be helpful if this error reported the row and column instead of the “position”.
Hi @nick-youngblut! This error message is being raised by Python when attempting to read the TSV file. I agree it’s not the most user-friendly error message. The error message is stating the value of the byte that can’t be decoded (0xc4) – I think this would be the 7725th byte in your file (“position” makes me think that the numbers are 0-based indexing instead of 1-based indexing).
You can check what encoding your file has by using one of the Unix tools described here. That may shed some light on the issue – it looks like you have a Unicode file encoded with something other than utf-8. Re-encoding to utf-8 may solve the issue.
I’m working on overhauling Metadata in qiime2 for this month’s release (2017.12), and part of the work will include better error messages for these types of situations. Would you mind sharing your metadata file with me so that I can make sure this situation is handled better? Feel free to send me a DM if you don’t want to share the file publicly. Thanks!
Thanks for the explanation! After a bit of trial and error (removing columns from the metadata table and seeing if the table would load), I was able to find the value that caused the error “start of box 12.17ƒ had to use 2nd aliquot b/c 1st was too small”. Removing the “ƒ” fixed the issue. If you add this character to any metadata table, you should be able to reproduce my error.
Thanks for the details @nick-youngblut, I will definitely test that out! I’m specifically curious about the encoding of the metadata file that you exported – the file contains a non-ASCII character ( “ƒ”), but that character can be encoded in the file a number of ways. Would you be able to run the following command on your metadata TSV file that contained the non-ASCII character, and send me the output? That would be helpful!
In the past, I have dealt with this issue by removing all the non-valid chars. Not elegant but it allows me to move forward and then fix the metadata I care based on the results/visualizations I want to use.
Anyway, a few options:
sed 's/[^[:print:]\r\t]//g' < your_input_file > your_output_file
grep -axv '.*' your_input_file > your_output_file
sed -i 's/[\d128-\d255]//g' your_input_file > your_output_file
Thanks @nick-youngblut! Interesting, I would have expected a non-ASCII encoding. Did you run this command on the TSV metadata file that was originally causing problems (i.e. the file with “ƒ”)? That’s specifically what I’m after here. No worries if this is too much of a pain – I can probably put together some unit tests to handle the case you ran into.
Thanks @antgonza! Just a note for anyone following along, this workaround shouldn’t be necessary after the metadata overhaul is released this month. I’ll follow up here when this is supported in the 2017.12 release!