Implementation:Google deepmind Dm control Environment Reset Step

Metadata	Value
Implementation	Environment Reset Step
Domain	Reinforcement_Learning, Physics_Simulation
Source	dm_control
Workflow	Control_Suite_RL_Training
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for executing the agent-environment interaction loop through the reset() and step(action) methods of the dm_control Environment class.

Description

The Environment class in dm_control.rl.control implements the dm_env.Environment interface with two core methods:

reset() starts a new episode:

Sets _reset_next_step to False and _step_count to 0.
Enters a physics.reset_context() and calls task.initialize_episode(physics) to randomise the initial state.
Returns a dm_env.TimeStep with step_type=FIRST, reward=None, and discount=None.

step(action) advances the simulation by one control step:

If _reset_next_step is True (because the previous step ended the episode), it calls reset() instead -- this is the auto-reset mechanism.
Calls task.before_step(action, physics) to apply the action (typically by setting actuator controls).
Calls physics.step(n_sub_steps) to advance the physics by n_sub_steps simulation timesteps.
Calls task.after_step(physics) for any post-step bookkeeping.
Computes the reward via task.get_reward(physics).
Obtains the observation via task.get_observation(physics).
Determines whether the episode is over:

- If _step_count >= _step_limit, the time limit has been reached and discount=1.0 (truncation).
- Otherwise, task.get_termination(physics) is called; if it returns a non-None discount, the task has terminated.

Returns a TimeStep with StepType.LAST and the terminal discount if the episode ended, or StepType.MID and discount=1.0 if it continues.

Usage

Use this implementation when:

You are running a training loop that collects transitions from a dm_control environment.
You need to distinguish between time-limit truncation (discount=1.0) and task termination (discount=0.0).
You want to rely on auto-reset behaviour without manually checking for terminal states.

Code Reference

Attribute	Detail
Source Location	`dm_control/rl/control.py:L82-127`
Signatures	`Environment.reset() -> dm_env.TimeStep`, `Environment.step(action) -> dm_env.TimeStep`
Import	`from dm_control.rl.control import Environment` (typically obtained via `suite.load`)

I/O Contract

Inputs -- reset()

Name	Type	Description
self	`Environment`	The environment instance. No additional arguments.

Inputs -- step(action)

Name	Type	Description
`action`	numpy array	Action conforming to `env.action_spec()`. Shape and dtype must match.

Outputs -- reset()

Field	Type	Value
`step_type`	`dm_env.StepType`	`StepType.FIRST`
`reward`	`None`	No reward on reset.
`discount`	`None`	No discount on reset.
`observation`	OrderedDict[str, ndarray]	Initial observation from the task.

Outputs -- step(action)

Field	Type	Value
`step_type`	`dm_env.StepType`	`StepType.MID` (continuing) or `StepType.LAST` (terminal).
`reward`	float	Scalar reward from `task.get_reward()`.
`discount`	float	1.0 (continuing or time-limit), or task-defined (typically 0.0 for failure).
`observation`	OrderedDict[str, ndarray]	Current observation from the task.

Usage Examples

Basic episode loop:

from dm_control import suite
import numpy as np

env = suite.load('cartpole', 'swingup')
action_spec = env.action_spec()

time_step = env.reset()
episode_return = 0.0

while not time_step.last():
    action = np.random.uniform(
        action_spec.minimum, action_spec.maximum, size=action_spec.shape
    )
    time_step = env.step(action)
    episode_return += time_step.reward

print(f"Episode return: {episode_return}")

Auto-reset behaviour (calling step after terminal):

from dm_control import suite
import numpy as np

env = suite.load('hopper', 'hop')
action_spec = env.action_spec()

time_step = env.reset()
for _ in range(10000):
    action = np.random.uniform(
        action_spec.minimum, action_spec.maximum, size=action_spec.shape
    )
    time_step = env.step(action)
    # No need to check time_step.last() and manually reset;
    # the next call to step() will auto-reset if the episode ended.

Distinguishing truncation from termination:

from dm_control import suite
import numpy as np

env = suite.load('walker', 'walk')
action_spec = env.action_spec()

time_step = env.reset()
while not time_step.last():
    action = np.random.uniform(
        action_spec.minimum, action_spec.maximum, size=action_spec.shape
    )
    time_step = env.step(action)

if time_step.discount == 1.0:
    print("Episode ended by time limit (truncation)")
elif time_step.discount == 0.0:
    print("Episode ended by task termination (e.g. fall)")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment