Implementation:Google deepmind Dm control Environment Reset Step
Appearance
| Metadata | Value |
|---|---|
| Implementation | Environment Reset Step |
| Domain | Reinforcement_Learning, Physics_Simulation |
| Source | dm_control |
| Workflow | Control_Suite_RL_Training |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for executing the agent-environment interaction loop through the reset() and step(action) methods of the dm_control Environment class.
Description
The Environment class in dm_control.rl.control implements the dm_env.Environment interface with two core methods:
reset() starts a new episode:
- Sets
_reset_next_steptoFalseand_step_countto 0. - Enters a
physics.reset_context()and callstask.initialize_episode(physics)to randomise the initial state. - Returns a
dm_env.TimeStepwithstep_type=FIRST,reward=None, anddiscount=None.
step(action) advances the simulation by one control step:
- If
_reset_next_stepisTrue(because the previous step ended the episode), it callsreset()instead -- this is the auto-reset mechanism. - Calls
task.before_step(action, physics)to apply the action (typically by setting actuator controls). - Calls
physics.step(n_sub_steps)to advance the physics byn_sub_stepssimulation timesteps. - Calls
task.after_step(physics)for any post-step bookkeeping. - Computes the reward via
task.get_reward(physics). - Obtains the observation via
task.get_observation(physics). - Determines whether the episode is over:
- If
_step_count >= _step_limit, the time limit has been reached anddiscount=1.0(truncation). - Otherwise,
task.get_termination(physics)is called; if it returns a non-Nonediscount, the task has terminated.
- If
- Returns a
TimeStepwithStepType.LASTand the terminal discount if the episode ended, orStepType.MIDanddiscount=1.0if it continues.
Usage
Use this implementation when:
- You are running a training loop that collects transitions from a dm_control environment.
- You need to distinguish between time-limit truncation (
discount=1.0) and task termination (discount=0.0). - You want to rely on auto-reset behaviour without manually checking for terminal states.
Code Reference
| Attribute | Detail |
|---|---|
| Source Location | dm_control/rl/control.py:L82-127
|
| Signatures | Environment.reset() -> dm_env.TimeStep, Environment.step(action) -> dm_env.TimeStep
|
| Import | from dm_control.rl.control import Environment (typically obtained via suite.load)
|
I/O Contract
Inputs -- reset()
| Name | Type | Description |
|---|---|---|
| self | Environment |
The environment instance. No additional arguments. |
Inputs -- step(action)
| Name | Type | Description |
|---|---|---|
action |
numpy array | Action conforming to env.action_spec(). Shape and dtype must match.
|
Outputs -- reset()
| Field | Type | Value |
|---|---|---|
step_type |
dm_env.StepType |
StepType.FIRST
|
reward |
None |
No reward on reset. |
discount |
None |
No discount on reset. |
observation |
OrderedDict[str, ndarray] | Initial observation from the task. |
Outputs -- step(action)
| Field | Type | Value |
|---|---|---|
step_type |
dm_env.StepType |
StepType.MID (continuing) or StepType.LAST (terminal).
|
reward |
float | Scalar reward from task.get_reward().
|
discount |
float | 1.0 (continuing or time-limit), or task-defined (typically 0.0 for failure). |
observation |
OrderedDict[str, ndarray] | Current observation from the task. |
Usage Examples
Basic episode loop:
from dm_control import suite
import numpy as np
env = suite.load('cartpole', 'swingup')
action_spec = env.action_spec()
time_step = env.reset()
episode_return = 0.0
while not time_step.last():
action = np.random.uniform(
action_spec.minimum, action_spec.maximum, size=action_spec.shape
)
time_step = env.step(action)
episode_return += time_step.reward
print(f"Episode return: {episode_return}")
Auto-reset behaviour (calling step after terminal):
from dm_control import suite
import numpy as np
env = suite.load('hopper', 'hop')
action_spec = env.action_spec()
time_step = env.reset()
for _ in range(10000):
action = np.random.uniform(
action_spec.minimum, action_spec.maximum, size=action_spec.shape
)
time_step = env.step(action)
# No need to check time_step.last() and manually reset;
# the next call to step() will auto-reset if the episode ended.
Distinguishing truncation from termination:
from dm_control import suite
import numpy as np
env = suite.load('walker', 'walk')
action_spec = env.action_spec()
time_step = env.reset()
while not time_step.last():
action = np.random.uniform(
action_spec.minimum, action_spec.maximum, size=action_spec.shape
)
time_step = env.step(action)
if time_step.discount == 1.0:
print("Episode ended by time limit (truncation)")
elif time_step.discount == 0.0:
print("Episode ended by task termination (e.g. fall)")
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment