Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Google deepmind Dm control Composer Environment For Manipulation

From Leeroopedia
Metadata
Knowledge Sources dm_control
Domains Reinforcement Learning, Robotics Simulation, Episode Management
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for running manipulation task episodes through the composer.Environment class, which wraps a composer task in the dm_env interface with reset/step cycling, sub-step physics integration, hook dispatching, and time-limited episode management.

Description

The composer.Environment class (in dm_control/composer/environment.py) is the runtime wrapper that the manipulation.load() function instantiates around every manipulation task. It inherits from dm_env.Environment and provides:

reset():

  • Attempts to initialise an episode, retrying up to max_reset_attempts times if an EpisodeInitializationError occurs.
  • Optionally recompiles the MJCF model between episodes (controlled by recompile_mjcf_every_episode).
  • Calls initialize_episode_mjcf() and initialize_episode() hooks on the task and all entities.
  • Resets the observation updater and returns a dm_env.TimeStep with step_type=FIRST, reward=None, discount=None.

step(action):

  • If a reset is pending (after a terminal step), automatically calls reset().
  • Calls before_step hooks, then loops through n_sub_steps physics integrations, calling before_substep/after_substep hooks around each.
  • If the physics diverges, catches the PhysicsError (unless raise_exception_on_physics_error=True) and terminates the episode with reward 0.
  • After all sub-steps, calls after_step hooks and updates observations.
  • Queries task.get_reward(physics) and task.get_discount(physics).
  • Checks task.should_terminate_episode(physics) and the time limit.
  • Returns a dm_env.TimeStep with step_type=MID (continuing) or step_type=LAST (terminal).

For manipulation tasks, the default time limit is 10 seconds of simulation time and the control timestep is 0.04 seconds (25 Hz agent action rate). The action vector controls the 6 arm joint velocities and 3 finger joint velocities (9 dimensions total).

The _EnvironmentHooks helper class memoises non-trivial entity hooks to avoid function-call overhead for entities whose hooks are no-ops.

Usage

The composer.Environment is not instantiated directly by users; instead, manipulation.load() creates it. Users interact with the returned object via the standard dm_env protocol: reset(), step(action), observation_spec(), action_spec(), reward_spec(), discount_spec().

Code Reference

Attribute Value
Source Location dm_control/composer/environment.py, lines 294--459
Signatures Environment(task, time_limit=inf, random_state=None, n_sub_steps=None, raise_exception_on_physics_error=True, strip_singleton_obs_buffer_dim=False, max_reset_attempts=1, recompile_mjcf_every_episode=True, fixed_initial_state=False, ...)
Environment.reset() -> dm_env.TimeStep
Environment.step(action: np.ndarray) -> dm_env.TimeStep
Import from dm_control import composer

I/O Contract

Inputs

Method Parameter Type Description
__init__ task composer.Task The task object containing the arena, robot entities, reward logic, and hooks.
__init__ time_limit float Maximum episode duration in simulation seconds. Default: inf (manipulation sets it to 10.0).
__init__ random_state int or np.random.RandomState or None Seed or RNG for episode randomisation.
__init__ max_reset_attempts int Maximum retries for episode initialisation. Default: 1.
step action np.ndarray Action array matching action_spec().shape. For Jaco manipulation: shape (9,) -- 6 arm joint velocities + 3 finger velocities.

Outputs

Method Return Type Description
reset() dm_env.TimeStep step_type=FIRST, reward=None, discount=None, observation=dict.
step(action) dm_env.TimeStep step_type=MID or LAST, reward=float, discount=float, observation=dict.
action_spec() dm_env.specs.BoundedArray Describes the shape and bounds of the action array.
observation_spec() dict[str, dm_env.specs.Array] Maps observation names to their array specifications.

Usage Examples

from dm_control import manipulation
import numpy as np

# Load a manipulation environment.
env = manipulation.load('reach_site_features', seed=42)

# Inspect specs.
action_spec = env.action_spec()
print('Action shape:', action_spec.shape)       # (9,)
print('Action range:', action_spec.minimum[0], 'to', action_spec.maximum[0])

obs_spec = env.observation_spec()
print('Observation keys:', list(obs_spec.keys()))

# Run one full episode.
timestep = env.reset()
total_reward = 0.0
step_count = 0

while not timestep.last():
    # Random policy.
    action = np.random.uniform(
        action_spec.minimum, action_spec.maximum, size=action_spec.shape)
    timestep = env.step(action)
    total_reward += timestep.reward
    step_count += 1

print(f'Episode finished after {step_count} steps, total reward: {total_reward:.2f}')
from dm_control import manipulation
import numpy as np

# Run multiple episodes for evaluation.
env = manipulation.load('lift_brick_features', seed=0)
action_spec = env.action_spec()

num_episodes = 5
for ep in range(num_episodes):
    timestep = env.reset()
    episode_return = 0.0
    while not timestep.last():
        action = np.zeros(action_spec.shape)  # zero-action baseline
        timestep = env.step(action)
        episode_return += timestep.reward
    print(f'Episode {ep}: return = {episode_return:.3f}')

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment