Principle:ARISE Initiative Robosuite HDF5 Dataset Aggregation
Metadata:
- robosuite
- Imitation_Learning
- Data_Engineering
- last_updated: 2026-02-15 12:00 GMT
Overview
Process for aggregating raw demonstration episode directories into a single structured HDF5 dataset file for efficient storage and access.
Description
After collecting demonstrations as per-timestep .npz files, the raw data must be aggregated into a single HDF5 file for efficient storage, fast random access, and standardized format. The HDF5 structure groups demonstrations under data/demo_N/ with datasets for states (flattened MuJoCo states), actions, and model_file (XML) attributes. Metadata includes collection date, time, repository version, and environment configuration.
The aggregation process involves:
- Reading raw demonstration directories containing per-timestep state and action files
- Extracting states (flattened MuJoCo simulator states) and actions from .npz files
- Organizing data hierarchically with each demonstration as a separate group
- Storing environment model XML files as attributes for reproducibility
- Embedding metadata about collection conditions and software versions
This standardized format enables:
- Efficient storage through HDF5 compression
- Fast random access to individual demonstrations or timesteps
- Portability across different systems and frameworks
- Reproducibility by preserving environment configurations
Usage
Use after collecting demonstrations to create a portable dataset file suitable for training imitation learning algorithms. The resulting HDF5 file serves as the standard input format for training policies through behavioral cloning, offline reinforcement learning, or other imitation learning methods.
Theoretical Basis
HDF5 (Hierarchical Data Format 5) provides chunked, compressed storage with random access capabilities, making it ideal for large-scale datasets. The hierarchical group structure maps naturally to demonstration datasets where each episode represents a logical unit.
Data Schema:
The HDF5 file follows this hierarchical structure:
demo.hdf5
├── data/
│ ├── demo_0/
│ │ ├── states (dataset: NxD array of MuJoCo states)
│ │ ├── actions (dataset: NxA array of actions)
│ │ └── model_file (attribute: XML string)
│ ├── demo_1/
│ │ ├── states
│ │ ├── actions
│ │ └── model_file
│ └── ...
└── metadata (attributes)
├── date (collection date)
├── time (collection time)
├── repository_version
└── env_info (JSON-encoded configuration)
Key Design Decisions:
- Flattened states: MuJoCo simulator states are flattened into 1D arrays for uniform dimensionality
- Per-demonstration groups: Each episode is self-contained, allowing independent access
- XML preservation: Model files stored as attributes ensure exact environment reproducibility
- Metadata embedding: Provenance information enables dataset versioning and debugging
Benefits:
- Compression: HDF5 chunking and compression reduce file sizes significantly compared to raw .npz files
- Random access: Individual demonstrations can be loaded without reading the entire file
- Language-agnostic: HDF5 libraries exist for Python, C++, MATLAB, and other languages
- Scalability: Supports datasets from megabytes to terabytes