Principle:ARISE Initiative Robomimic Dataset Loading
| Knowledge Sources | |
|---|---|
| Domains | Robotics, Data_Pipeline, Offline_Learning |
| Last Updated | 2026-02-15 08:00 GMT |
Overview
A dataset construction principle that loads offline robot demonstration data from HDF5 files into memory-efficient PyTorch Dataset objects with support for temporal sequences, filter keys, and train/validation splitting.
Description
Dataset Loading is a critical principle in offline robot learning (learning from demonstrations). Unlike online reinforcement learning where data is collected during training, offline methods require loading pre-collected demonstration trajectories. The challenge is handling large HDF5 datasets containing multi-modal observations (images, proprioception, actions) across potentially hundreds of demonstrations, each with variable length.
This principle addresses several key challenges:
- Memory management: Datasets can be too large to fit in RAM, requiring on-demand loading with optional caching
- Temporal structure: Robot learning algorithms often need sequences of consecutive frames (frame stacking, history windows), not individual transitions
- Flexible subsetting: Filter keys allow selecting subsets of demonstrations (e.g., only training demos, or only the first N demos) without creating separate files
- Multi-dataset training: Multiple HDF5 files can be combined for joint training across tasks
Usage
Use this principle after configuration setup and observation initialization. It is required before algorithm instantiation because the dataset provides shape metadata (observation dimensions, action space size) needed to construct neural networks.
Theoretical Basis
The dataset loading principle follows an HDF5 filter key pattern for flexible data subsetting:
# Abstract pattern (not real implementation)
# HDF5 file structure:
# data/
# demo_0/ (obs, actions, rewards, dones)
# demo_1/
# ...
# mask/
# train: [demo_0, demo_2, demo_5, ...] # filter key
# valid: [demo_1, demo_3, ...] # filter key
# 20_demos: [demo_0, ..., demo_19] # filter key
# Loading uses filter keys to select demos
demo_keys = hdf5["mask/train"] # only training demos
dataset = SequenceDataset(hdf5_path, demo_keys, seq_length=10)
The SequenceDataset internally indexes all valid subsequences across demonstrations, creating a flat index that maps dataset index to (demo_id, start_frame) pairs. This enables standard PyTorch DataLoader batching.