Implementation:Google deepmind Dm control Composer Environment For Manipulation
| Metadata | |
|---|---|
| Knowledge Sources | dm_control |
| Domains | Reinforcement Learning, Robotics Simulation, Episode Management |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for running manipulation task episodes through the composer.Environment class, which wraps a composer task in the dm_env interface with reset/step cycling, sub-step physics integration, hook dispatching, and time-limited episode management.
Description
The composer.Environment class (in dm_control/composer/environment.py) is the runtime wrapper that the manipulation.load() function instantiates around every manipulation task. It inherits from dm_env.Environment and provides:
reset():
- Attempts to initialise an episode, retrying up to
max_reset_attemptstimes if anEpisodeInitializationErroroccurs. - Optionally recompiles the MJCF model between episodes (controlled by
recompile_mjcf_every_episode). - Calls
initialize_episode_mjcf()andinitialize_episode()hooks on the task and all entities. - Resets the observation updater and returns a
dm_env.TimeStepwithstep_type=FIRST,reward=None,discount=None.
step(action):
- If a reset is pending (after a terminal step), automatically calls
reset(). - Calls
before_stephooks, then loops throughn_sub_stepsphysics integrations, callingbefore_substep/after_substephooks around each. - If the physics diverges, catches the
PhysicsError(unlessraise_exception_on_physics_error=True) and terminates the episode with reward 0. - After all sub-steps, calls
after_stephooks and updates observations. - Queries
task.get_reward(physics)andtask.get_discount(physics). - Checks
task.should_terminate_episode(physics)and the time limit. - Returns a
dm_env.TimeStepwithstep_type=MID(continuing) orstep_type=LAST(terminal).
For manipulation tasks, the default time limit is 10 seconds of simulation time and the control timestep is 0.04 seconds (25 Hz agent action rate). The action vector controls the 6 arm joint velocities and 3 finger joint velocities (9 dimensions total).
The _EnvironmentHooks helper class memoises non-trivial entity hooks to avoid function-call overhead for entities whose hooks are no-ops.
Usage
The composer.Environment is not instantiated directly by users; instead, manipulation.load() creates it. Users interact with the returned object via the standard dm_env protocol: reset(), step(action), observation_spec(), action_spec(), reward_spec(), discount_spec().
Code Reference
| Attribute | Value |
|---|---|
| Source Location | dm_control/composer/environment.py, lines 294--459
|
| Signatures | Environment(task, time_limit=inf, random_state=None, n_sub_steps=None, raise_exception_on_physics_error=True, strip_singleton_obs_buffer_dim=False, max_reset_attempts=1, recompile_mjcf_every_episode=True, fixed_initial_state=False, ...)Environment.reset() -> dm_env.TimeStepEnvironment.step(action: np.ndarray) -> dm_env.TimeStep
|
| Import | from dm_control import composer
|
I/O Contract
Inputs
| Method | Parameter | Type | Description |
|---|---|---|---|
__init__ |
task |
composer.Task |
The task object containing the arena, robot entities, reward logic, and hooks. |
__init__ |
time_limit |
float |
Maximum episode duration in simulation seconds. Default: inf (manipulation sets it to 10.0).
|
__init__ |
random_state |
int or np.random.RandomState or None |
Seed or RNG for episode randomisation. |
__init__ |
max_reset_attempts |
int |
Maximum retries for episode initialisation. Default: 1. |
step |
action |
np.ndarray |
Action array matching action_spec().shape. For Jaco manipulation: shape (9,) -- 6 arm joint velocities + 3 finger velocities.
|
Outputs
| Method | Return Type | Description |
|---|---|---|
reset() |
dm_env.TimeStep |
step_type=FIRST, reward=None, discount=None, observation=dict.
|
step(action) |
dm_env.TimeStep |
step_type=MID or LAST, reward=float, discount=float, observation=dict.
|
action_spec() |
dm_env.specs.BoundedArray |
Describes the shape and bounds of the action array. |
observation_spec() |
dict[str, dm_env.specs.Array] |
Maps observation names to their array specifications. |
Usage Examples
from dm_control import manipulation
import numpy as np
# Load a manipulation environment.
env = manipulation.load('reach_site_features', seed=42)
# Inspect specs.
action_spec = env.action_spec()
print('Action shape:', action_spec.shape) # (9,)
print('Action range:', action_spec.minimum[0], 'to', action_spec.maximum[0])
obs_spec = env.observation_spec()
print('Observation keys:', list(obs_spec.keys()))
# Run one full episode.
timestep = env.reset()
total_reward = 0.0
step_count = 0
while not timestep.last():
# Random policy.
action = np.random.uniform(
action_spec.minimum, action_spec.maximum, size=action_spec.shape)
timestep = env.step(action)
total_reward += timestep.reward
step_count += 1
print(f'Episode finished after {step_count} steps, total reward: {total_reward:.2f}')
from dm_control import manipulation
import numpy as np
# Run multiple episodes for evaluation.
env = manipulation.load('lift_brick_features', seed=0)
action_spec = env.action_spec()
num_episodes = 5
for ep in range(num_episodes):
timestep = env.reset()
episode_return = 0.0
while not timestep.last():
action = np.zeros(action_spec.shape) # zero-action baseline
timestep = env.step(action)
episode_return += timestep.reward
print(f'Episode {ep}: return = {episode_return:.3f}')