Implementation:ARISE Initiative Robosuite Gather Demonstrations As HDF5
Metadata:
- robosuite
- Imitation_Learning
- Data_Engineering
- last_updated: 2026-02-15 12:00 GMT
Overview
Concrete function for aggregating raw demonstration directories into a single HDF5 file provided by the robosuite collection scripts.
Description
The gather_demonstrations_as_hdf5() function reads raw episode directories (containing state_*.npz files and model.xml), and writes a single demo.hdf5 file. Each demonstration is stored as data/demo_N/ with states and actions datasets plus model_file attribute.
The function performs the following operations:
- Scans the input directory for demonstration subdirectories
- Reads per-timestep state and action data from .npz files
- Aggregates timesteps into demonstration-level arrays
- Creates an HDF5 file with hierarchical group structure
- Stores states and actions as datasets within each demo group
- Embeds model.xml content as an attribute for environment reproducibility
- Adds metadata attributes including collection date, time, and environment configuration
Usage
Called after all demonstration episodes are collected to create the final dataset file. This function is typically invoked as the final step in the demonstration collection pipeline, after human teleoperation or scripted data collection has completed.
Code Reference
Source: robosuite
File: robosuite/scripts/collect_human_demonstrations.py
Lines: L120-207
Signature:
def gather_demonstrations_as_hdf5(directory, out_dir, env_info):
"""
Gathers demonstrations saved in @directory into a single hdf5 file.
Args:
directory (str): Path to raw demonstration directories
out_dir (str): Path to store the hdf5 file
env_info (str): JSON-encoded environment information string
"""
Import:
from robosuite.scripts.collect_human_demonstrations import gather_demonstrations_as_hdf5
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| directory | str | Yes | Path to raw demonstration directories containing state_*.npz files and model.xml |
| out_dir | str | Yes | Path to directory where the output HDF5 file will be stored |
| env_info | str | Yes | JSON-encoded string containing environment configuration (robot, controller, task parameters) |
Outputs
File: demo.hdf5 with the following structure:
demo.hdf5
├── data/
│ ├── demo_0/
│ │ ├── states (dataset: float array, shape [N, D])
│ │ ├── actions (dataset: float array, shape [N, A])
│ │ └── model_file (attribute: string, XML content)
│ ├── demo_1/
│ │ └── ...
│ └── demo_K/
│ └── ...
└── (root attributes)
├── date (string: collection date)
├── time (string: collection time)
├── repository_version (string: git commit hash)
└── env (string: JSON environment configuration)
Dataset Shapes:
- states: [num_timesteps, state_dim] - Flattened MuJoCo simulator states
- actions: [num_timesteps, action_dim] - Robot control actions
Usage Examples
Example 1: Basic Aggregation
import json
from robosuite.scripts.collect_human_demonstrations import gather_demonstrations_as_hdf5
# Define paths
raw_demo_dir = "/tmp/raw_demonstrations"
output_dir = "/tmp/datasets"
# Environment configuration
env_config = {
"env_name": "Lift",
"robots": "Panda",
"controller": "OSC_POSE",
"horizon": 500
}
# Convert to JSON string
env_info_json = json.dumps(env_config)
# Aggregate demonstrations
gather_demonstrations_as_hdf5(
directory=raw_demo_dir,
out_dir=output_dir,
env_info=env_info_json
)
print(f"Dataset created at {output_dir}/demo.hdf5")
Example 2: Reading Back the HDF5 File
import h5py
import numpy as np
# Open the aggregated dataset
with h5py.File("/tmp/datasets/demo.hdf5", "r") as f:
# Read metadata
print("Collection date:", f.attrs["date"])
print("Repository version:", f.attrs["repository_version"])
print("Environment config:", f.attrs["env"])
# Access first demonstration
demo_0 = f["data/demo_0"]
# Load states and actions
states = demo_0["states"][:] # Shape: [N, state_dim]
actions = demo_0["actions"][:] # Shape: [N, action_dim]
# Read environment model XML
model_xml = demo_0.attrs["model_file"]
print(f"Demo 0: {len(states)} timesteps")
print(f"State dimension: {states.shape[1]}")
print(f"Action dimension: {actions.shape[1]}")
# Iterate through all demonstrations
num_demos = len([k for k in f["data"].keys() if k.startswith("demo")])
print(f"Total demonstrations: {num_demos}")
for i in range(num_demos):
demo = f[f"data/demo_{i}"]
print(f"Demo {i}: {len(demo['states'])} timesteps")
Example 3: Integration with Data Collection Pipeline
import os
import json
import robosuite as suite
from robosuite.scripts.collect_human_demonstrations import gather_demonstrations_as_hdf5
# Step 1: Collect demonstrations (simplified example)
env = suite.make(
"Lift",
robots="Panda",
has_renderer=True,
has_offscreen_renderer=False,
use_camera_obs=False,
)
# ... collect demonstrations and save to raw directory ...
# (demonstration collection code omitted for brevity)
# Step 2: Aggregate after collection completes
raw_dir = "/tmp/demos/raw"
output_dir = "/tmp/demos/processed"
# Gather environment metadata
env_info = {
"env_name": "Lift",
"type": 1, # Environment type
"env_kwargs": {
"robots": "Panda",
"controller_configs": {"type": "OSC_POSE"}
}
}
# Aggregate into HDF5
gather_demonstrations_as_hdf5(
directory=raw_dir,
out_dir=output_dir,
env_info=json.dumps(env_info)
)
print(f"Dataset ready for training at {output_dir}/demo.hdf5")