Principle:Google deepmind Dm control Manipulation Episode Loop

Metadata
Knowledge Sources	dm_control
Domains	Reinforcement Learning, Robotics Simulation, Episode Management
Last Updated	2026-02-15 00:00 GMT

Overview

The manipulation episode loop is the principle of running a reinforcement learning episode as a sequence of reset-step cycles, where each step applies an action, advances the physics simulation through multiple sub-steps, computes a reward, and returns a standardised timestep that signals whether the episode continues or has terminated.

Description

An RL episode in a manipulation environment follows the dm_env protocol:

Reset -- initialise the physics state, randomise the scene (robot pose, object positions), and return a FIRST timestep containing the initial observation.
Step -- accept an action from the agent, apply it to the actuators, advance the physics simulation, compute the reward and discount, and return a MID or LAST timestep.
Repeat step 2 until the episode terminates (time limit exceeded, task-specific termination condition, or physics divergence).

The step operation is not a single physics integration. It comprises:

Before-step hooks -- the task and its entities can modify the physics state or apply pre-processing before integration begins.
Sub-steps -- the physics simulator is stepped multiple times per agent action, with before-substep and after-substep hooks called around each integration step. This decouples the agent's control frequency from the physics simulation frequency.
After-step hooks -- the task and entities can perform post-processing (e.g. updating internal state) after all sub-steps complete.
Observation update -- the observation buffer is updated from the current physics state.
Reward and discount computation -- the task's get_reward() and get_discount() methods are called.
Termination check -- the task's should_terminate_episode() method and the time limit are consulted.

The reset operation supports multiple attempts: if an EpisodeInitializationError is raised (e.g. due to an infeasible random initialisation), the environment retries up to a configurable number of times before propagating the error.

Usage

The episode loop is executed by any training or evaluation script that interacts with a manipulation environment. The agent calls reset() once, then calls step(action) repeatedly until timestep.last() is True.

Theoretical Basis

The episode loop follows the standard MDP (Markov Decision Process) interaction protocol:

timestep = env.reset()           # s_0, r=None, step_type=FIRST
while not timestep.last():
    action = agent.select_action(timestep.observation)
    timestep = env.step(action)  # s_t+1, r_t, step_type=MID or LAST

# Inside env.step(action):
#   1. before_step(action)
#   2. for i in range(n_sub_steps):
#        before_substep(action)
#        physics.step()           # MuJoCo integration
#        after_substep()
#        if i < n_sub_steps - 1:
#            update_observations()
#   3. after_step()
#   4. update_observations()
#   5. reward = task.get_reward(physics)
#   6. discount = task.get_discount(physics)
#   7. terminated = task.should_terminate(physics) or time >= time_limit
#   8. return TimeStep(MID or LAST, reward, discount, observation)

The control timestep for manipulation tasks is 0.04 seconds (25 Hz). With the default MuJoCo simulation timestep, this means multiple physics integration sub-steps per agent action, ensuring numerical stability while keeping the agent's decision frequency at a practical rate.

If the physics diverges during a step, the episode is terminated with reward 0 and discount 0, preventing corrupted state from propagating to the agent.

Related Pages

Implementation:Google_deepmind_Dm_control_Composer_Environment_For_Manipulation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment