Implementation:Huggingface Datasets HDF5 Builder

Source	src/datasets/packaged_modules/hdf5/hdf5.py (lines 40-94)
Domain(s)	Data_Loading, Scientific_Computing
Last Updated	2026-02-14

Overview

Description

HDF5 is a packaged dataset builder (subclass of ArrowBasedBuilder) in the HuggingFace Datasets library that loads HDF5 files into Apache Arrow tables. HDF5 (Hierarchical Data Format version 5) is a widely used file format in scientific computing for storing large, complex datasets with hierarchical group structures.

The builder automatically infers dataset features from the HDF5 file structure by recursively traversing groups and datasets. It handles a variety of HDF5 data types including:

Standard numeric arrays -- Converted to the appropriate Arrow/HuggingFace array types (Value, List, Array2D through Array5D).
Complex-valued arrays -- Decomposed into {"real": ..., "imag": ...} struct representations.
Compound (structured) dtypes -- Recursively expanded into nested struct features.
Variable-length data -- Including variable-length strings and numeric arrays via h5py.vlen_dtype.

The builder supports configurable batch sizes for chunked reading and validates that all top-level datasets within an HDF5 file share the same row count.

Usage

Use the HDF5 builder when you need to load scientific or research data stored in HDF5 format into the HuggingFace Datasets ecosystem. Common scenarios include:

Loading simulation output data, sensor readings, or experimental results stored as HDF5.
Converting large scientific datasets with hierarchical structure into Arrow-backed datasets for efficient processing.
Working with complex-valued or compound-typed data that requires special handling.

Code Reference

Source Location

Repository: huggingface/datasets

File: src/datasets/packaged_modules/hdf5/hdf5.py (lines 40-94)

Signature

@dataclass
class HDF5Config(datasets.BuilderConfig):
    batch_size: Optional[int] = None
    features: Optional[datasets.Features] = None


class HDF5(datasets.ArrowBasedBuilder):
    BUILDER_CONFIG_CLASS = HDF5Config

Key Methods:

_info(self) -> DatasetInfo -- Returns dataset info using features from the config (or None if features will be inferred at load time).
_split_generators(self, dl_manager) -> list[SplitGenerator] -- Downloads data files, infers features from the first HDF5 file if not provided, and creates split generators for each split.
_generate_tables(self, files) -> Iterator[tuple[Key, pa.Table]] -- Opens each HDF5 file, reads data in batches (using batch_size from config, writer batch size, or the full row count), and yields Arrow tables cast to the inferred features.

Import

# Not imported directly; used via load_dataset
from datasets import load_dataset

ds = load_dataset("hdf5", data_files="path/to/file.h5")

I/O Contract

Inputs

Parameter	Type	Description
`data_files`	`str` or `list[str]` or `dict[str, str]`	Path(s) to HDF5 files. Supports `.h5` and `.hdf5` extensions.
`batch_size`	`Optional[int]` (default: `None`)	Number of rows to read per batch. Defaults to writer batch size or total number of rows if unset.
`features`	`Optional[datasets.Features]` (default: `None`)	Explicit feature schema. If `None`, features are inferred by recursively inspecting the HDF5 file structure.

Supported HDF5 data types:

HDF5 Type	Arrow/HF Mapping
Numeric scalars	`Value` (int, float types)
1D arrays	`List(Value)` with fixed length
2D-5D arrays	`Array2D` through `Array5D`
Complex types (`complex64`, `complex128`)	`{"real": ..., "imag": ...}` struct
Compound dtypes	Nested struct features
Variable-length strings/arrays	`Value("string")` or `List(Value)`

Outputs

Output	Type	Description
Yielded tables	`pa.Table`	Arrow tables with features cast to the inferred or provided feature schema, keyed by `(file_idx, batch_idx)`.
Final dataset	`Dataset`	A standard HuggingFace Dataset backed by Arrow storage.

Usage Examples

Basic HDF5 loading

from datasets import load_dataset

# Load a single HDF5 file
ds = load_dataset("hdf5", data_files="experiments/results.h5")
print(ds["train"].features)
print(ds["train"][0])

Loading with explicit splits and batch size

from datasets import load_dataset

# Load multiple HDF5 files with a custom batch size
ds = load_dataset(
    "hdf5",
    data_files={
        "train": "data/train_*.hdf5",
        "test": "data/test_*.hdf5",
    },
    batch_size=1000,
)
print(ds["train"].num_rows)

Loading with explicit features

from datasets import Features, Value, Array2D, load_dataset

features = Features({
    "signal": Array2D(shape=(128, 256), dtype="float32"),
    "label": Value("int64"),
})

ds = load_dataset("hdf5", data_files="data/signals.h5", features=features)

Related Pages

Principles

HDF5 Dataset Building -- Principle for loading and converting HDF5 hierarchical data into Arrow-backed datasets.

Environments

Huggingface Datasets -- The parent library providing the dataset builder infrastructure.
Scientific Computing -- Domain context for HDF5 data handling in research environments.

Related Implementations

ArrowBasedBuilder -- The base class that provides the Arrow table writing and caching infrastructure.
ParquetBuilder -- Another Arrow-based builder for Parquet files, following a similar pattern.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment