Principle:Google deepmind Dm control RL Episode Loop
| Metadata | Value |
|---|---|
| Principle | RL Episode Loop |
| Domain | Reinforcement_Learning, Physics_Simulation |
| Source | dm_control |
| Workflow | Control_Suite_RL_Training |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
The RL episode loop is the fundamental interaction cycle in which an agent repeatedly observes the environment state, selects an action, and receives a reward and next observation until the episode terminates.
Description
Every reinforcement learning algorithm, regardless of its update rule, relies on a common outer loop:
- Reset the environment to obtain the initial observation (a
FIRSTtime-step with no reward or discount). - Step the environment with an action chosen by the agent, receiving a
MIDtime-step that includes an observation, a scalar reward, and a discount factor of 1.0. - Repeat stepping until the environment signals episode termination by returning a
LASTtime-step, which carries a final reward and a terminal discount. - Return to step 1 for the next episode.
The dm_env interface formalises this loop through the StepType enum (FIRST, MID, LAST) and the TimeStep namedtuple (step_type, reward, discount, observation). Two kinds of episode termination are distinguished:
- Time limit -- the episode has run for the maximum allowed duration. The discount is 1.0, signalling that the value of the terminal state should not be zeroed out (the episode was truncated, not truly terminal).
- Task termination -- the task's
get_terminationmethod returns a discount (typically 0.0) indicating a true terminal state (e.g. the agent has fallen).
The environment also implements an auto-reset convenience: if step() is called after a LAST time-step has been returned, it silently calls reset() instead.
Usage
Apply this principle whenever:
- You are writing a training or evaluation loop for any RL algorithm.
- You need to understand the semantics of discount values for bootstrapping in value-based or actor-critic methods.
- You want to handle both truncation and true termination correctly in your loss computation.
Theoretical Basis
The canonical episode loop in pseudocode:
function run_episode(env, agent):
time_step = env.reset() // StepType.FIRST, reward=None
episode_return = 0
while time_step.step_type != LAST:
action = agent.select_action(time_step.observation)
time_step = env.step(action) // StepType.MID or LAST
episode_return += time_step.reward
return episode_return
Within env.step(action), the internal sequence is:
task.before_step(action, physics) // apply action to actuators
physics.step(n_sub_steps) // advance simulation
task.after_step(physics) // optional post-step hook
reward = task.get_reward(physics)
observation = task.get_observation(physics)
termination = task.get_termination(physics)
The discount returned at time limit is 1.0 (truncation), while the discount returned by get_termination is task-defined (usually 0.0 for failure).