Workflow:ARISE Initiative Robomimic Dataset Preparation Pipeline

Knowledge Sources	Robomimic Robomimic Dataset Docs Robomimic Getting Started
Domains	Data_Engineering, Robot_Learning, Simulation
Last Updated	2026-02-15 07:30 GMT

Overview

End-to-end process for acquiring, transforming, and preparing robot demonstration datasets in HDF5 format for use with robomimic training pipelines.

Description

This workflow covers the full data preparation pipeline from raw demonstration data to training-ready HDF5 datasets. It begins with downloading benchmark datasets from Hugging Face (simulation) or Stanford servers (real-world), then processes raw simulation states into observation-rich datasets by replaying trajectories through the simulator to extract low-dimensional state, RGB images, depth maps, and camera intrinsics/extrinsics. The pipeline includes creating train/validation splits via HDF5 filter keys and optionally filtering datasets by demonstration count for ablation studies. The resulting datasets are directly consumable by robomimic's SequenceDataset loader.

Usage

Execute this workflow when setting up a new robomimic experiment and you need to prepare demonstration data for training. This covers scenarios including: downloading the official robomimic benchmark datasets (lift, can, square, transport, tool_hang), converting raw simulation state recordings into observation datasets with image or low-dimensional modalities, preparing custom robosuite demonstration data for training, and creating data splits for rigorous train/validation evaluation.

Execution Steps

Step 1: Dataset Download

Acquire raw demonstration datasets from the robomimic dataset registry. Simulation datasets are hosted on Hugging Face and downloaded via the huggingface_hub API, while real-world robot datasets are hosted at Stanford and downloaded via URL. The download script supports filtering by task (lift, can, square, transport, tool_hang, plus real variants), dataset type (proficient-human, multi-human, machine-generated, paired), and HDF5 format (raw, low_dim, image, sparse/dense reward variants).

Key considerations:

Default download location is the datasets/ directory adjacent to the robomimic package
A dry-run mode previews which datasets would be downloaded without actual transfer
Raw datasets contain simulation states but not rendered observations
Pre-processed low_dim and image datasets are available for convenience but can also be generated locally

Step 2: Observation Extraction

Transform raw HDF5 datasets containing simulation states into observation-rich datasets by replaying each trajectory through the simulator. For each demonstration, the environment is reset to each recorded state to extract observations (proprioception, object state, images, depth). Rewards and done signals are re-inferred from the simulator. Camera intrinsics and extrinsics are captured for robosuite environments. The output is a new HDF5 file with full observation data alongside actions, states, rewards, and dones.

Key considerations:

Low-dimensional extraction omits camera names; image extraction requires specifying camera names and resolution
A multiprocess variant is available for faster extraction on multi-core machines
Compression (gzip) and next-obs exclusion options reduce file size for image datasets
Done mode controls how the terminal signal is written: task success (0), trajectory end (1), or both (2)
Existing filter keys from the source file are automatically copied to the output

Step 3: Train Validation Split

Create train/validation partitions by writing filter keys into the HDF5 file's mask group. The split randomly assigns demonstration trajectories to training and validation sets at a configurable ratio (default 10% validation). Filter keys are stored as arrays of demonstration names under mask/train and mask/valid, allowing the training pipeline to load the appropriate subset without duplicating data.

Key considerations:

Splitting is done in-place by adding filter keys to the existing HDF5 file
A fixed random seed ensures reproducible splits across runs
An optional input filter key allows splitting a subset of demonstrations rather than the full dataset
The training pipeline requires both train and valid filter keys when validation is enabled

Step 4: Dataset Size Filtering

Create additional filter keys that restrict training to a specified number of demonstrations, enabling controlled ablation studies over dataset size. This step randomly selects N demonstrations from the full set (or from a specified input filter key) and writes a new filter key. Multiple sizes can be processed in a single invocation.

Key considerations:

Filter key names default to "{N}_demos" format but can be customized
When combined with an input filter key, the output name is prefixed with the input key name
This is primarily used for benchmark experiments studying the effect of dataset scale
The fixed random seed ensures consistent demo selection across experiments

Execution Diagram

GitHub URL

Workflow Repository