Principle:Google deepmind Dm control RL Episode Loop

Metadata	Value
Principle	RL Episode Loop
Domain	Reinforcement_Learning, Physics_Simulation
Source	dm_control
Workflow	Control_Suite_RL_Training
Last Updated	2026-02-15 00:00 GMT

Overview

The RL episode loop is the fundamental interaction cycle in which an agent repeatedly observes the environment state, selects an action, and receives a reward and next observation until the episode terminates.

Description

Every reinforcement learning algorithm, regardless of its update rule, relies on a common outer loop:

Reset the environment to obtain the initial observation (a FIRST time-step with no reward or discount).
Step the environment with an action chosen by the agent, receiving a MID time-step that includes an observation, a scalar reward, and a discount factor of 1.0.
Repeat stepping until the environment signals episode termination by returning a LAST time-step, which carries a final reward and a terminal discount.
Return to step 1 for the next episode.

The dm_env interface formalises this loop through the StepType enum (FIRST, MID, LAST) and the TimeStep namedtuple (step_type, reward, discount, observation). Two kinds of episode termination are distinguished:

Time limit -- the episode has run for the maximum allowed duration. The discount is 1.0, signalling that the value of the terminal state should not be zeroed out (the episode was truncated, not truly terminal).
Task termination -- the task's get_termination method returns a discount (typically 0.0) indicating a true terminal state (e.g. the agent has fallen).

The environment also implements an auto-reset convenience: if step() is called after a LAST time-step has been returned, it silently calls reset() instead.

Usage

Apply this principle whenever:

You are writing a training or evaluation loop for any RL algorithm.
You need to understand the semantics of discount values for bootstrapping in value-based or actor-critic methods.
You want to handle both truncation and true termination correctly in your loss computation.

Theoretical Basis

The canonical episode loop in pseudocode:

function run_episode(env, agent):
    time_step = env.reset()                       // StepType.FIRST, reward=None
    episode_return = 0

    while time_step.step_type != LAST:
        action = agent.select_action(time_step.observation)
        time_step = env.step(action)              // StepType.MID or LAST
        episode_return += time_step.reward

    return episode_return

Within env.step(action), the internal sequence is:

task.before_step(action, physics)      // apply action to actuators
physics.step(n_sub_steps)             // advance simulation
task.after_step(physics)              // optional post-step hook
reward      = task.get_reward(physics)
observation = task.get_observation(physics)
termination = task.get_termination(physics)

The discount returned at time limit is 1.0 (truncation), while the discount returned by get_termination is task-defined (usually 0.0 for failure).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment