Implementation:Google deepmind Dm control Composer Environment For Manipulation

Metadata
Knowledge Sources	dm_control
Domains	Reinforcement Learning, Robotics Simulation, Episode Management
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for running manipulation task episodes through the composer.Environment class, which wraps a composer task in the dm_env interface with reset/step cycling, sub-step physics integration, hook dispatching, and time-limited episode management.

Description

The composer.Environment class (in dm_control/composer/environment.py) is the runtime wrapper that the manipulation.load() function instantiates around every manipulation task. It inherits from dm_env.Environment and provides:

reset():

Attempts to initialise an episode, retrying up to max_reset_attempts times if an EpisodeInitializationError occurs.
Optionally recompiles the MJCF model between episodes (controlled by recompile_mjcf_every_episode).
Calls initialize_episode_mjcf() and initialize_episode() hooks on the task and all entities.
Resets the observation updater and returns a dm_env.TimeStep with step_type=FIRST, reward=None, discount=None.

step(action):

If a reset is pending (after a terminal step), automatically calls reset().
Calls before_step hooks, then loops through n_sub_steps physics integrations, calling before_substep/after_substep hooks around each.
If the physics diverges, catches the PhysicsError (unless raise_exception_on_physics_error=True) and terminates the episode with reward 0.
After all sub-steps, calls after_step hooks and updates observations.
Queries task.get_reward(physics) and task.get_discount(physics).
Checks task.should_terminate_episode(physics) and the time limit.
Returns a dm_env.TimeStep with step_type=MID (continuing) or step_type=LAST (terminal).

For manipulation tasks, the default time limit is 10 seconds of simulation time and the control timestep is 0.04 seconds (25 Hz agent action rate). The action vector controls the 6 arm joint velocities and 3 finger joint velocities (9 dimensions total).

The _EnvironmentHooks helper class memoises non-trivial entity hooks to avoid function-call overhead for entities whose hooks are no-ops.

Usage

The composer.Environment is not instantiated directly by users; instead, manipulation.load() creates it. Users interact with the returned object via the standard dm_env protocol: reset(), step(action), observation_spec(), action_spec(), reward_spec(), discount_spec().

Code Reference

Attribute	Value
Source Location	`dm_control/composer/environment.py`, lines 294--459
Signatures	`Environment(task, time_limit=inf, random_state=None, n_sub_steps=None, raise_exception_on_physics_error=True, strip_singleton_obs_buffer_dim=False, max_reset_attempts=1, recompile_mjcf_every_episode=True, fixed_initial_state=False, ...)` `Environment.reset() -> dm_env.TimeStep` `Environment.step(action: np.ndarray) -> dm_env.TimeStep`
Import	`from dm_control import composer`

I/O Contract

Inputs

Method	Parameter	Type	Description
`__init__`	`task`	`composer.Task`	The task object containing the arena, robot entities, reward logic, and hooks.
`__init__`	`time_limit`	`float`	Maximum episode duration in simulation seconds. Default: `inf` (manipulation sets it to 10.0).
`__init__`	`random_state`	`int` or `np.random.RandomState` or `None`	Seed or RNG for episode randomisation.
`__init__`	`max_reset_attempts`	`int`	Maximum retries for episode initialisation. Default: 1.
`step`	`action`	`np.ndarray`	Action array matching `action_spec().shape`. For Jaco manipulation: shape `(9,)` -- 6 arm joint velocities + 3 finger velocities.

Outputs

Method	Return Type	Description
`reset()`	`dm_env.TimeStep`	`step_type=FIRST`, `reward=None`, `discount=None`, `observation=dict`.
`step(action)`	`dm_env.TimeStep`	`step_type=MID` or `LAST`, `reward=float`, `discount=float`, `observation=dict`.
`action_spec()`	`dm_env.specs.BoundedArray`	Describes the shape and bounds of the action array.
`observation_spec()`	`dict[str, dm_env.specs.Array]`	Maps observation names to their array specifications.

Usage Examples

from dm_control import manipulation
import numpy as np

# Load a manipulation environment.
env = manipulation.load('reach_site_features', seed=42)

# Inspect specs.
action_spec = env.action_spec()
print('Action shape:', action_spec.shape)       # (9,)
print('Action range:', action_spec.minimum[0], 'to', action_spec.maximum[0])

obs_spec = env.observation_spec()
print('Observation keys:', list(obs_spec.keys()))

# Run one full episode.
timestep = env.reset()
total_reward = 0.0
step_count = 0

while not timestep.last():
    # Random policy.
    action = np.random.uniform(
        action_spec.minimum, action_spec.maximum, size=action_spec.shape)
    timestep = env.step(action)
    total_reward += timestep.reward
    step_count += 1

print(f'Episode finished after {step_count} steps, total reward: {total_reward:.2f}')

from dm_control import manipulation
import numpy as np

# Run multiple episodes for evaluation.
env = manipulation.load('lift_brick_features', seed=0)
action_spec = env.action_spec()

num_episodes = 5
for ep in range(num_episodes):
    timestep = env.reset()
    episode_return = 0.0
    while not timestep.last():
        action = np.zeros(action_spec.shape)  # zero-action baseline
        timestep = env.step(action)
        episode_return += timestep.reward
    print(f'Episode {ep}: return = {episode_return:.3f}')

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment