Feature-table gives: 'ascii' codec can't decode byte 0xc3 in position 4 Error

ebolyen · December 16, 2025, 11:09am

I think it is actually a combination of H5Py changes and BIOM:

In the 3.0 series, the following changed:

UTF-8 Variable -> numpy 'O' of bytes (tagged with UTF-8 encoding via dtype)

github.com/h5py/h5py

String conversions for h5py 3.0

opened 03:04PM - 21 Sep 19 UTC

closed 07:46AM - 03 Aug 20 UTC

takluyver

enhancement api changes string/unicode usage

@tacaswell, @aragilar and myself sat down yesterday to discuss how various strin…g-like types should be mapped between Python and HDF5. We have decided we do want to make some changes for h5py 3.0. This will inevitably break some code using h5py, but we think it will create a more consistent API, which is also a better fit with Python 3's separation of bytes & str. We have divided the rules into three kinds of conversion: ```python # 1. Writing to an existing dataset f['data'][:10] = obj # Also applies to creating datasets with data and a specified dtype f.create_dataset('data', dtype=x, data=obj) # 2. Creating a new dataset without a specified dtype f['data'] = obj f.create_dataset('data', data=obj) # 3. Reading data obj = f['data'][:] ``` We aim to ensure that `f['data2'] = f['data1'][:]` will always create the new dataset with the same data type as the copied one. We believe the rules below preserve this - please comment if you notice a case we've missed. ## 1. Writing to existing datasets Everything rejects numpy U dtypes (UTF-32 fixed width), as in h5py 2.x, because there is no equivalent HDF5 type. ### fixed-width ascii - accept bytes - accept str and encode to ascii, error on invalid ascii - If you need to write non-ascii data, encode it first and pass bytes - ~check that it will fit in width, raise if not~ **Edit:** No length checks, matching numpy's behaviour with string arrays. I've put a warning in the docs (PR #1613) about this. ### fixed width utf-8 - accept bytes, ~just check length~ - accept str, encode to utf-8 ### variable width ascii - accept any bytes (except NULL) - accept str, encode to ascii, error on non-ascii ### variable length utf-8 - accept any bytes (except NULL) - accept str, encode to utf-8 ### Opaque - accept bytes, np.void() - reject str ## 2. Creating datasets - numpy object/string array with 'tagged' dtype -> follow the tag - tagged means a dtype with string metadata created by h5py to indicate string charset & width - numpy string array not tagged -> fixed width ascii - numpy object array of bytes not tagged -> ~void opaque~ variable length ascii **(see discussion in comments)** - List of bytes -> ~opaque~ variable length ascii - numpy object array of str not tagged -> variable length utf-8 - list of str -> variable length utf-8 - numpy void array -> opaque - numpy array of U type -> raise ## 3. Reading data - ASCII Fixed -> numpy 'S' (tagged with ASCII encoding) - UTF-8 Fixed -> numpy 'S' (tagged with UTF-8 encoding) - ASCII Variable -> numpy 'O' of bytes (tagged with ASCII encoding via dtype) - UTF-8 Variable -> numpy 'O' of bytes (tagged with UTF-8 encoding via dtype) - Opaque with no tag -> numpy 'V' - Opaque with h5py dtype tag -> follow tag ## Attributes Attributes follow the same rules as for datasets, with a couple of exceptions: - An attribute created from a single str/bytes object will be a scalar vlen string with UTF-8 charset (str) or ASCII (bytes). - **Edit:** Still true, but no longer a special case. - An attribute with a scalar vlen string type will be returned as a single str ~/bytes~ object depending on its charset, to preserve roundtripping. - **Edit:** all vlen string attributes are now read as str (decoded utf-8 with surrogateescape). We no longer return different types anywhere based on ASCII/UTF-8. - Section 1 does not apply, as h5py does not expose a high-level API to modify an attribute.

And this change in BIOM, would have meant that the byte-sequence is no longer decoded:

Which combined with the behavior of np.asarray means that non-ASCII UTF-8 sequences fail.

In [1]: from h5py import File

In [2]: import numpy as np

In [3]: fh = File('Downloads/feature-table.biom')

In [4]: np.asarray(fh.get('sample/ids')[:], dtype='U20')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[4], line 1
----> 1 np.asarray(fh.get('sample/ids')[:], dtype='U20')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

using .asstr() as described in the later-versions of the docs (which looked to be slightly stale for 3.11.0):

In [5]: np.asarray(fh.get('sample/ids').asstr()[:], dtype='U20')
Out[5]:
array(['li8', 'li12', 'li4', ..., 'yanez-montalvo19', 'yanez-montalvo9',
       'yanez-montalvo1'], dtype='<U20')