Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Google deepmind Dm control Environment Reset Step

From Leeroopedia
Metadata Value
Implementation Environment Reset Step
Domain Reinforcement_Learning, Physics_Simulation
Source dm_control
Workflow Control_Suite_RL_Training
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for executing the agent-environment interaction loop through the reset() and step(action) methods of the dm_control Environment class.

Description

The Environment class in dm_control.rl.control implements the dm_env.Environment interface with two core methods:

reset() starts a new episode:

  1. Sets _reset_next_step to False and _step_count to 0.
  2. Enters a physics.reset_context() and calls task.initialize_episode(physics) to randomise the initial state.
  3. Returns a dm_env.TimeStep with step_type=FIRST, reward=None, and discount=None.

step(action) advances the simulation by one control step:

  1. If _reset_next_step is True (because the previous step ended the episode), it calls reset() instead -- this is the auto-reset mechanism.
  2. Calls task.before_step(action, physics) to apply the action (typically by setting actuator controls).
  3. Calls physics.step(n_sub_steps) to advance the physics by n_sub_steps simulation timesteps.
  4. Calls task.after_step(physics) for any post-step bookkeeping.
  5. Computes the reward via task.get_reward(physics).
  6. Obtains the observation via task.get_observation(physics).
  7. Determines whether the episode is over:
    • If _step_count >= _step_limit, the time limit has been reached and discount=1.0 (truncation).
    • Otherwise, task.get_termination(physics) is called; if it returns a non-None discount, the task has terminated.
  1. Returns a TimeStep with StepType.LAST and the terminal discount if the episode ended, or StepType.MID and discount=1.0 if it continues.

Usage

Use this implementation when:

  • You are running a training loop that collects transitions from a dm_control environment.
  • You need to distinguish between time-limit truncation (discount=1.0) and task termination (discount=0.0).
  • You want to rely on auto-reset behaviour without manually checking for terminal states.

Code Reference

Attribute Detail
Source Location dm_control/rl/control.py:L82-127
Signatures Environment.reset() -> dm_env.TimeStep, Environment.step(action) -> dm_env.TimeStep
Import from dm_control.rl.control import Environment (typically obtained via suite.load)

I/O Contract

Inputs -- reset()

Name Type Description
self Environment The environment instance. No additional arguments.

Inputs -- step(action)

Name Type Description
action numpy array Action conforming to env.action_spec(). Shape and dtype must match.

Outputs -- reset()

Field Type Value
step_type dm_env.StepType StepType.FIRST
reward None No reward on reset.
discount None No discount on reset.
observation OrderedDict[str, ndarray] Initial observation from the task.

Outputs -- step(action)

Field Type Value
step_type dm_env.StepType StepType.MID (continuing) or StepType.LAST (terminal).
reward float Scalar reward from task.get_reward().
discount float 1.0 (continuing or time-limit), or task-defined (typically 0.0 for failure).
observation OrderedDict[str, ndarray] Current observation from the task.

Usage Examples

Basic episode loop:

from dm_control import suite
import numpy as np

env = suite.load('cartpole', 'swingup')
action_spec = env.action_spec()

time_step = env.reset()
episode_return = 0.0

while not time_step.last():
    action = np.random.uniform(
        action_spec.minimum, action_spec.maximum, size=action_spec.shape
    )
    time_step = env.step(action)
    episode_return += time_step.reward

print(f"Episode return: {episode_return}")

Auto-reset behaviour (calling step after terminal):

from dm_control import suite
import numpy as np

env = suite.load('hopper', 'hop')
action_spec = env.action_spec()

time_step = env.reset()
for _ in range(10000):
    action = np.random.uniform(
        action_spec.minimum, action_spec.maximum, size=action_spec.shape
    )
    time_step = env.step(action)
    # No need to check time_step.last() and manually reset;
    # the next call to step() will auto-reset if the episode ended.

Distinguishing truncation from termination:

from dm_control import suite
import numpy as np

env = suite.load('walker', 'walk')
action_spec = env.action_spec()

time_step = env.reset()
while not time_step.last():
    action = np.random.uniform(
        action_spec.minimum, action_spec.maximum, size=action_spec.shape
    )
    time_step = env.step(action)

if time_step.discount == 1.0:
    print("Episode ended by time limit (truncation)")
elif time_step.discount == 0.0:
    print("Episode ended by task termination (e.g. fall)")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment