Principle:ARISE Initiative Robomimic Train Validation Split
| Knowledge Sources | |
|---|---|
| Domains | Robotics, Data_Pipeline, Data_Splitting |
| Last Updated | 2026-02-15 08:00 GMT |
Overview
A demonstration-level data splitting pattern that partitions robot demonstration datasets into training and validation subsets using in-place HDF5 filter keys without duplicating data.
Description
Train Validation Split creates disjoint training and validation subsets from a demonstration dataset. Unlike splitting individual transitions, this operates at the demonstration level: entire trajectories are assigned to either train or validation. This is critical for robot learning because splitting at the transition level would leak information from the same trajectory into both sets.
The splitting mechanism uses HDF5 filter keys stored in the dataset's mask/ group. A filter key is simply a list of demonstration names (e.g., ["demo_0", "demo_3", "demo_5"]) that defines a subset. This avoids creating duplicate HDF5 files and allows multiple overlapping subsets to coexist in the same file.
Usage
Use this principle after observation extraction and before training. It is a prerequisite for validated training (when config.experiment.validate is True). The resulting filter keys ("train" and "valid") are referenced by config.train.hdf5_filter_key and config.train.hdf5_validation_filter_key.
Theoretical Basis
# Abstract splitting pattern (not real implementation)
demos = ["demo_0", "demo_1", ..., "demo_99"]
num_val = int(0.1 * len(demos)) # 10% for validation
# Random assignment
random.shuffle(indices)
val_demos = demos[:num_val]
train_demos = demos[num_val:]
# Store as filter keys in HDF5 mask/ group
hdf5["mask/train"] = train_demos
hdf5["mask/valid"] = val_demos
The filter key approach supports nested splitting: a subset (e.g., "20_demos") can itself be split into train/valid, producing "20_demos_train" and "20_demos_valid".