Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets HDF5 Builder

From Leeroopedia
Source src/datasets/packaged_modules/hdf5/hdf5.py (lines 40-94)
Domain(s) Data_Loading, Scientific_Computing
Last Updated 2026-02-14

Overview

Description

HDF5 is a packaged dataset builder (subclass of ArrowBasedBuilder) in the HuggingFace Datasets library that loads HDF5 files into Apache Arrow tables. HDF5 (Hierarchical Data Format version 5) is a widely used file format in scientific computing for storing large, complex datasets with hierarchical group structures.

The builder automatically infers dataset features from the HDF5 file structure by recursively traversing groups and datasets. It handles a variety of HDF5 data types including:

  • Standard numeric arrays -- Converted to the appropriate Arrow/HuggingFace array types (Value, List, Array2D through Array5D).
  • Complex-valued arrays -- Decomposed into {"real": ..., "imag": ...} struct representations.
  • Compound (structured) dtypes -- Recursively expanded into nested struct features.
  • Variable-length data -- Including variable-length strings and numeric arrays via h5py.vlen_dtype.

The builder supports configurable batch sizes for chunked reading and validates that all top-level datasets within an HDF5 file share the same row count.

Usage

Use the HDF5 builder when you need to load scientific or research data stored in HDF5 format into the HuggingFace Datasets ecosystem. Common scenarios include:

  • Loading simulation output data, sensor readings, or experimental results stored as HDF5.
  • Converting large scientific datasets with hierarchical structure into Arrow-backed datasets for efficient processing.
  • Working with complex-valued or compound-typed data that requires special handling.

Code Reference

Source Location

Repository: huggingface/datasets

File: src/datasets/packaged_modules/hdf5/hdf5.py (lines 40-94)

Signature

@dataclass
class HDF5Config(datasets.BuilderConfig):
    batch_size: Optional[int] = None
    features: Optional[datasets.Features] = None


class HDF5(datasets.ArrowBasedBuilder):
    BUILDER_CONFIG_CLASS = HDF5Config

Key Methods:

  • _info(self) -> DatasetInfo -- Returns dataset info using features from the config (or None if features will be inferred at load time).
  • _split_generators(self, dl_manager) -> list[SplitGenerator] -- Downloads data files, infers features from the first HDF5 file if not provided, and creates split generators for each split.
  • _generate_tables(self, files) -> Iterator[tuple[Key, pa.Table]] -- Opens each HDF5 file, reads data in batches (using batch_size from config, writer batch size, or the full row count), and yields Arrow tables cast to the inferred features.

Import

# Not imported directly; used via load_dataset
from datasets import load_dataset

ds = load_dataset("hdf5", data_files="path/to/file.h5")

I/O Contract

Inputs

Parameter Type Description
data_files str or list[str] or dict[str, str] Path(s) to HDF5 files. Supports .h5 and .hdf5 extensions.
batch_size Optional[int] (default: None) Number of rows to read per batch. Defaults to writer batch size or total number of rows if unset.
features Optional[datasets.Features] (default: None) Explicit feature schema. If None, features are inferred by recursively inspecting the HDF5 file structure.

Supported HDF5 data types:

HDF5 Type Arrow/HF Mapping
Numeric scalars Value (int, float types)
1D arrays List(Value) with fixed length
2D-5D arrays Array2D through Array5D
Complex types (complex64, complex128) {"real": ..., "imag": ...} struct
Compound dtypes Nested struct features
Variable-length strings/arrays Value("string") or List(Value)

Outputs

Output Type Description
Yielded tables pa.Table Arrow tables with features cast to the inferred or provided feature schema, keyed by (file_idx, batch_idx).
Final dataset Dataset A standard HuggingFace Dataset backed by Arrow storage.

Usage Examples

Basic HDF5 loading

from datasets import load_dataset

# Load a single HDF5 file
ds = load_dataset("hdf5", data_files="experiments/results.h5")
print(ds["train"].features)
print(ds["train"][0])

Loading with explicit splits and batch size

from datasets import load_dataset

# Load multiple HDF5 files with a custom batch size
ds = load_dataset(
    "hdf5",
    data_files={
        "train": "data/train_*.hdf5",
        "test": "data/test_*.hdf5",
    },
    batch_size=1000,
)
print(ds["train"].num_rows)

Loading with explicit features

from datasets import Features, Value, Array2D, load_dataset

features = Features({
    "signal": Array2D(shape=(128, 256), dtype="float32"),
    "label": Value("int64"),
})

ds = load_dataset("hdf5", data_files="data/signals.h5", features=features)

Related Pages

Principles

  • HDF5 Dataset Building -- Principle for loading and converting HDF5 hierarchical data into Arrow-backed datasets.

Environments

Related Implementations

  • ArrowBasedBuilder -- The base class that provides the Arrow table writing and caching infrastructure.
  • ParquetBuilder -- Another Arrow-based builder for Parquet files, following a similar pattern.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment