Implementation:Huggingface Datasets HDF5 Builder
| Source | src/datasets/packaged_modules/hdf5/hdf5.py (lines 40-94) |
|---|---|
| Domain(s) | Data_Loading, Scientific_Computing |
| Last Updated | 2026-02-14 |
Overview
Description
HDF5 is a packaged dataset builder (subclass of ArrowBasedBuilder) in the HuggingFace Datasets library that loads HDF5 files into Apache Arrow tables. HDF5 (Hierarchical Data Format version 5) is a widely used file format in scientific computing for storing large, complex datasets with hierarchical group structures.
The builder automatically infers dataset features from the HDF5 file structure by recursively traversing groups and datasets. It handles a variety of HDF5 data types including:
- Standard numeric arrays -- Converted to the appropriate Arrow/HuggingFace array types (
Value,List,Array2DthroughArray5D). - Complex-valued arrays -- Decomposed into
{"real": ..., "imag": ...}struct representations. - Compound (structured) dtypes -- Recursively expanded into nested struct features.
- Variable-length data -- Including variable-length strings and numeric arrays via
h5py.vlen_dtype.
The builder supports configurable batch sizes for chunked reading and validates that all top-level datasets within an HDF5 file share the same row count.
Usage
Use the HDF5 builder when you need to load scientific or research data stored in HDF5 format into the HuggingFace Datasets ecosystem. Common scenarios include:
- Loading simulation output data, sensor readings, or experimental results stored as HDF5.
- Converting large scientific datasets with hierarchical structure into Arrow-backed datasets for efficient processing.
- Working with complex-valued or compound-typed data that requires special handling.
Code Reference
Source Location
Repository: huggingface/datasets
File: src/datasets/packaged_modules/hdf5/hdf5.py (lines 40-94)
Signature
@dataclass
class HDF5Config(datasets.BuilderConfig):
batch_size: Optional[int] = None
features: Optional[datasets.Features] = None
class HDF5(datasets.ArrowBasedBuilder):
BUILDER_CONFIG_CLASS = HDF5Config
Key Methods:
_info(self) -> DatasetInfo-- Returns dataset info using features from the config (orNoneif features will be inferred at load time)._split_generators(self, dl_manager) -> list[SplitGenerator]-- Downloads data files, infers features from the first HDF5 file if not provided, and creates split generators for each split._generate_tables(self, files) -> Iterator[tuple[Key, pa.Table]]-- Opens each HDF5 file, reads data in batches (usingbatch_sizefrom config, writer batch size, or the full row count), and yields Arrow tables cast to the inferred features.
Import
# Not imported directly; used via load_dataset
from datasets import load_dataset
ds = load_dataset("hdf5", data_files="path/to/file.h5")
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
data_files |
str or list[str] or dict[str, str] |
Path(s) to HDF5 files. Supports .h5 and .hdf5 extensions.
|
batch_size |
Optional[int] (default: None) |
Number of rows to read per batch. Defaults to writer batch size or total number of rows if unset. |
features |
Optional[datasets.Features] (default: None) |
Explicit feature schema. If None, features are inferred by recursively inspecting the HDF5 file structure.
|
Supported HDF5 data types:
| HDF5 Type | Arrow/HF Mapping |
|---|---|
| Numeric scalars | Value (int, float types)
|
| 1D arrays | List(Value) with fixed length
|
| 2D-5D arrays | Array2D through Array5D
|
Complex types (complex64, complex128) |
{"real": ..., "imag": ...} struct
|
| Compound dtypes | Nested struct features |
| Variable-length strings/arrays | Value("string") or List(Value)
|
Outputs
| Output | Type | Description |
|---|---|---|
| Yielded tables | pa.Table |
Arrow tables with features cast to the inferred or provided feature schema, keyed by (file_idx, batch_idx).
|
| Final dataset | Dataset |
A standard HuggingFace Dataset backed by Arrow storage. |
Usage Examples
Basic HDF5 loading
from datasets import load_dataset
# Load a single HDF5 file
ds = load_dataset("hdf5", data_files="experiments/results.h5")
print(ds["train"].features)
print(ds["train"][0])
Loading with explicit splits and batch size
from datasets import load_dataset
# Load multiple HDF5 files with a custom batch size
ds = load_dataset(
"hdf5",
data_files={
"train": "data/train_*.hdf5",
"test": "data/test_*.hdf5",
},
batch_size=1000,
)
print(ds["train"].num_rows)
Loading with explicit features
from datasets import Features, Value, Array2D, load_dataset
features = Features({
"signal": Array2D(shape=(128, 256), dtype="float32"),
"label": Value("int64"),
})
ds = load_dataset("hdf5", data_files="data/signals.h5", features=features)
Related Pages
Principles
- HDF5 Dataset Building -- Principle for loading and converting HDF5 hierarchical data into Arrow-backed datasets.
Environments
- Huggingface Datasets -- The parent library providing the dataset builder infrastructure.
- Scientific Computing -- Domain context for HDF5 data handling in research environments.
Related Implementations
ArrowBasedBuilder-- The base class that provides the Arrow table writing and caching infrastructure.ParquetBuilder-- Another Arrow-based builder for Parquet files, following a similar pattern.