Feature-table gives: 'ascii' codec can't decode byte 0xc3 in position 4 Error

I think it is actually a combination of H5Py changes and BIOM:

In the 3.0 series, the following changed:

  • UTF-8 Variable -> numpy 'O' of bytes (tagged with UTF-8 encoding via dtype)

And this change in BIOM, would have meant that the byte-sequence is no longer decoded:

Which combined with the behavior of np.asarray means that non-ASCII UTF-8 sequences fail.


In [1]: from h5py import File

In [2]: import numpy as np

In [3]: fh = File('Downloads/feature-table.biom')

In [4]: np.asarray(fh.get('sample/ids')[:], dtype='U20')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[4], line 1
----> 1 np.asarray(fh.get('sample/ids')[:], dtype='U20')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

using .asstr() as described in the later-versions of the docs (which looked to be slightly stale for 3.11.0):

In [5]: np.asarray(fh.get('sample/ids').asstr()[:], dtype='U20')
Out[5]:
array(['li8', 'li12', 'li4', ..., 'yanez-montalvo19', 'yanez-montalvo9',
       'yanez-montalvo1'], dtype='<U20')