Principle:Google deepmind Dm control Manipulation Episode Loop
| Metadata | |
|---|---|
| Knowledge Sources | dm_control |
| Domains | Reinforcement Learning, Robotics Simulation, Episode Management |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
The manipulation episode loop is the principle of running a reinforcement learning episode as a sequence of reset-step cycles, where each step applies an action, advances the physics simulation through multiple sub-steps, computes a reward, and returns a standardised timestep that signals whether the episode continues or has terminated.
Description
An RL episode in a manipulation environment follows the dm_env protocol:
- Reset -- initialise the physics state, randomise the scene (robot pose, object positions), and return a
FIRSTtimestep containing the initial observation. - Step -- accept an action from the agent, apply it to the actuators, advance the physics simulation, compute the reward and discount, and return a
MIDorLASTtimestep. - Repeat step 2 until the episode terminates (time limit exceeded, task-specific termination condition, or physics divergence).
The step operation is not a single physics integration. It comprises:
- Before-step hooks -- the task and its entities can modify the physics state or apply pre-processing before integration begins.
- Sub-steps -- the physics simulator is stepped multiple times per agent action, with before-substep and after-substep hooks called around each integration step. This decouples the agent's control frequency from the physics simulation frequency.
- After-step hooks -- the task and entities can perform post-processing (e.g. updating internal state) after all sub-steps complete.
- Observation update -- the observation buffer is updated from the current physics state.
- Reward and discount computation -- the task's
get_reward()andget_discount()methods are called. - Termination check -- the task's
should_terminate_episode()method and the time limit are consulted.
The reset operation supports multiple attempts: if an EpisodeInitializationError is raised (e.g. due to an infeasible random initialisation), the environment retries up to a configurable number of times before propagating the error.
Usage
The episode loop is executed by any training or evaluation script that interacts with a manipulation environment. The agent calls reset() once, then calls step(action) repeatedly until timestep.last() is True.
Theoretical Basis
The episode loop follows the standard MDP (Markov Decision Process) interaction protocol:
timestep = env.reset() # s_0, r=None, step_type=FIRST
while not timestep.last():
action = agent.select_action(timestep.observation)
timestep = env.step(action) # s_t+1, r_t, step_type=MID or LAST
# Inside env.step(action):
# 1. before_step(action)
# 2. for i in range(n_sub_steps):
# before_substep(action)
# physics.step() # MuJoCo integration
# after_substep()
# if i < n_sub_steps - 1:
# update_observations()
# 3. after_step()
# 4. update_observations()
# 5. reward = task.get_reward(physics)
# 6. discount = task.get_discount(physics)
# 7. terminated = task.should_terminate(physics) or time >= time_limit
# 8. return TimeStep(MID or LAST, reward, discount, observation)
The control timestep for manipulation tasks is 0.04 seconds (25 Hz). With the default MuJoCo simulation timestep, this means multiple physics integration sub-steps per agent action, ensuring numerical stability while keeping the agent's decision frequency at a practical rate.
If the physics diverges during a step, the episode is terminated with reward 0 and discount 0, preventing corrupted state from propagating to the agent.