ebolyen
(Evan Bolyen)
December 16, 2025, 11:09am
7
I think it is actually a combination of H5Py changes and BIOM:
In the 3.0 series, the following changed:
UTF-8 Variable -> numpy 'O' of bytes (tagged with UTF-8 encoding via dtype)
opened 03:04PM - 21 Sep 19 UTC
closed 07:46AM - 03 Aug 20 UTC
enhancement
api changes
string/unicode usage
@tacaswell, @aragilar and myself sat down yesterday to discuss how various strin… g-like types should be mapped between Python and HDF5. We have decided we do want to make some changes for h5py 3.0. This will inevitably break some code using h5py, but we think it will create a more consistent API, which is also a better fit with Python 3's separation of bytes & str.
We have divided the rules into three kinds of conversion:
```python
# 1. Writing to an existing dataset
f['data'][:10] = obj
# Also applies to creating datasets with data and a specified dtype
f.create_dataset('data', dtype=x, data=obj)
# 2. Creating a new dataset without a specified dtype
f['data'] = obj
f.create_dataset('data', data=obj)
# 3. Reading data
obj = f['data'][:]
```
We aim to ensure that `f['data2'] = f['data1'][:]` will always create the new dataset with the same data type as the copied one. We believe the rules below preserve this - please comment if you notice a case we've missed.
## 1. Writing to existing datasets
Everything rejects numpy U dtypes (UTF-32 fixed width), as in h5py 2.x, because there is no equivalent HDF5 type.
### fixed-width ascii
- accept bytes
- accept str and encode to ascii, error on invalid ascii
- If you need to write non-ascii data, encode it first and pass bytes
- ~check that it will fit in width, raise if not~ **Edit:** No length checks, matching numpy's behaviour with string arrays. I've put a warning in the docs (PR #1613) about this.
### fixed width utf-8
- accept bytes, ~just check length~
- accept str, encode to utf-8
### variable width ascii
- accept any bytes (except NULL)
- accept str, encode to ascii, error on non-ascii
### variable length utf-8
- accept any bytes (except NULL)
- accept str, encode to utf-8
### Opaque
- accept bytes, np.void()
- reject str
## 2. Creating datasets
- numpy object/string array with 'tagged' dtype -> follow the tag
- tagged means a dtype with string metadata created by h5py to indicate
string charset & width
- numpy string array not tagged -> fixed width ascii
- numpy object array of bytes not tagged -> ~void opaque~ variable length ascii **(see discussion in comments)**
- List of bytes -> ~opaque~ variable length ascii
- numpy object array of str not tagged -> variable length utf-8
- list of str -> variable length utf-8
- numpy void array -> opaque
- numpy array of U type -> raise
## 3. Reading data
- ASCII Fixed -> numpy 'S' (tagged with ASCII encoding)
- UTF-8 Fixed -> numpy 'S' (tagged with UTF-8 encoding)
- ASCII Variable -> numpy 'O' of bytes (tagged with ASCII encoding via dtype)
- UTF-8 Variable -> numpy 'O' of bytes (tagged with UTF-8 encoding via dtype)
- Opaque with no tag -> numpy 'V'
- Opaque with h5py dtype tag -> follow tag
## Attributes
Attributes follow the same rules as for datasets, with a couple of exceptions:
- An attribute created from a single str/bytes object will be a scalar
vlen string with UTF-8 charset (str) or ASCII (bytes).
- **Edit:** Still true, but no longer a special case.
- An attribute with a scalar vlen string type will be returned as a single
str ~/bytes~ object depending on its charset, to preserve roundtripping.
- **Edit:** all vlen string attributes are now read as str (decoded utf-8 with surrogateescape). We no longer return different types anywhere based on ASCII/UTF-8.
- Section 1 does not apply, as h5py does not expose a high-level API to modify
an attribute.
And this change in BIOM, would have meant that the byte-sequence is no longer decoded:
master ← wasade:conserve_types
opened 05:01PM - 21 Mar 22 UTC
Improve handling of IDs.
* set a fixed width on load from HDF5
* set a fixe… d with on update of IDs
* avoid forcing `dtype=object` on construction
These are done for the benefit of lower level optimizations where specificity of the dtype is high value.
Which combined with the behavior of np.asarray means that non-ASCII UTF-8 sequences fail.
In [1]: from h5py import File
In [2]: import numpy as np
In [3]: fh = File('Downloads/feature-table.biom')
In [4]: np.asarray(fh.get('sample/ids')[:], dtype='U20')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
Cell In[4], line 1
----> 1 np.asarray(fh.get('sample/ids')[:], dtype='U20')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
using .asstr() as described in the later-versions of the docs (which looked to be slightly stale for 3.11.0):
In [5]: np.asarray(fh.get('sample/ids').asstr()[:], dtype='U20')
Out[5]:
array(['li8', 'li12', 'li4', ..., 'yanez-montalvo19', 'yanez-montalvo9',
'yanez-montalvo1'], dtype='<U20')