Implementation:Google deepmind Dm control Composer Environment

Attribute	Value
Implementation	Composer Environment
Workflow	Composer_Environment_Building
Domain	Reinforcement_Learning, Composition
Source	dm_control
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for assembling a complete reinforcement learning environment from a Composer Task and its entity hierarchy, exposing the standard dm_env.Environment interface with support for MJCF recompilation, multi-rate observations, robust resetting, and physics error handling.

Description

The Environment class in dm_control.composer.environment is the top-level entry point for running Composer-based RL experiments. It inherits from both _CommonEnvironment (which handles physics compilation, observation updater creation, and the hooks system) and dm_env.Environment (which defines the standard RL interface).

Constructor parameters:

task -- a Task instance that defines the root entity, reward, and termination logic.
time_limit -- maximum episode duration in seconds (default: infinity).
random_state -- an integer seed or np.random.RandomState for reproducibility.
max_reset_attempts -- how many times to retry reset if EpisodeInitializationError is raised (default: 1, i.e., no retry).
recompile_mjcf_every_episode -- if True (default), calls initialize_episode_mjcf and recompiles the physics at the start of every episode. Set to False for a speedup when the model does not change between episodes.
raise_exception_on_physics_error -- if False, physics divergence terminates the episode with a warning instead of raising.
strip_singleton_obs_buffer_dim -- if True, observations with buffer_size=1 have the leading buffer dimension squeezed.
fixed_initial_state -- if True, every episode starts from the same random state, making trajectories deterministic given the same actions.
delayed_observation_padding -- ObservationPadding.ZERO or ObservationPadding.INITIAL_VALUE, controlling how delayed observation buffers are initialized.
legacy_step -- if True (default), steps the physics state with up-to-date position and velocity dependent fields.

Key methods:

reset() -- initializes a new episode. Calls initialize_episode_mjcf, recompiles physics (if configured), calls initialize_episode, resets the observation updater, and returns the first TimeStep.
step(action) -- advances the environment by one control step. Calls the before/after hooks, steps the physics n_sub_steps times, updates observations, computes reward and discount, checks termination, and returns a TimeStep.
observation_spec() -- returns the observation specification from the updater.
action_spec() -- delegates to task.action_spec(physics).
reward_spec() / discount_spec() -- return custom specs from the task or the dm_env defaults.
close() -- frees the underlying physics resources.

Hooks system:

The internal _EnvironmentHooks object scans all entities in the task's entity tree and memoizes non-trivial callback methods. During stepping, only non-empty callbacks are invoked, avoiding the overhead of calling no-op methods on many entities. Extra hooks can be added via add_extra_hook(hook_name, hook_callable).

Usage

Instantiate composer.Environment with a configured Task and interact with it using the standard dm_env loop. Adjust constructor parameters to control recompilation frequency, robustness, and observation buffering.

Code Reference

Attribute	Value
Source Location	`dm_control/composer/environment.py:L294-517`
Signature	`Environment.__init__(self, task, time_limit=float('inf'), random_state=None, n_sub_steps=None, raise_exception_on_physics_error=True, strip_singleton_obs_buffer_dim=False, max_reset_attempts=1, recompile_mjcf_every_episode=True, fixed_initial_state=False, delayed_observation_padding=ObservationPadding.ZERO, legacy_step=True)`
Signature (reset)	`Environment.reset(self) -> dm_env.TimeStep`
Signature (step)	`Environment.step(self, action) -> dm_env.TimeStep`
Signature (observation_spec)	`Environment.observation_spec(self) -> OrderedDict`
Signature (action_spec)	`Environment.action_spec(self) -> specs.BoundedArray`
Import	`from dm_control import composer` or `from dm_control.composer import environment`

I/O Contract

Inputs

Name	Type	Description
`task`	`Task`	A fully configured Composer task with root entity, reward, and termination logic
`time_limit`	float	Maximum episode duration in seconds
`random_state`	int or `np.random.RandomState`	Seed or RNG for reproducibility
`max_reset_attempts`	int	Number of reset retries on `EpisodeInitializationError`
`recompile_mjcf_every_episode`	bool	Whether to recompile physics each episode
`action`	`np.ndarray`	(for `step`) Agent action matching `action_spec`

Outputs

Name	Type	Description
`reset()` return	`dm_env.TimeStep`	`TimeStep(FIRST, None, None, observation)`
`step()` return	`dm_env.TimeStep`	`TimeStep(MID or LAST, reward, discount, observation)`
`observation_spec()` return	`OrderedDict[str, specs.Array]`	Maps observation names to array specs
`action_spec()` return	`specs.BoundedArray`	Shape, dtype, and bounds of the action space
`reward_spec()` return	`specs.Array`	Specification of the reward signal
`discount_spec()` return	`specs.BoundedArray`	Specification of the discount factor
`physics`	`weakref.ProxyType[mjcf.Physics]`	Weak reference to the current physics instance
`task`	`Task`	The task driving this environment

Usage Examples

Basic environment creation and interaction

from dm_control import composer
import numpy as np


# Assume ReachTask is defined as in the Task implementation page
task = ReachTask(robot=my_robot, target_entity=my_target)

env = composer.Environment(
    task=task,
    time_limit=10.0,
    random_state=42)

# Standard dm_env interaction loop
timestep = env.reset()
while not timestep.last():
    action = np.random.uniform(
        low=env.action_spec().minimum,
        high=env.action_spec().maximum)
    timestep = env.step(action)
    print(f"Reward: {timestep.reward}")

env.close()

Environment with domain randomization and robust resetting

env = composer.Environment(
    task=randomized_task,
    time_limit=20.0,
    random_state=123,
    max_reset_attempts=5,
    recompile_mjcf_every_episode=True,
    raise_exception_on_physics_error=False)

# The environment will retry up to 5 times if initialization fails,
# and will gracefully handle physics divergence by terminating the episode.
timestep = env.reset()

Faster environment without per-episode recompilation

# When the MJCF model does not change between episodes,
# skip recompilation for a significant speedup.
env = composer.Environment(
    task=static_task,
    time_limit=30.0,
    recompile_mjcf_every_episode=False,
    strip_singleton_obs_buffer_dim=True)

# Observations with buffer_size=1 will not have a leading dimension.
timestep = env.reset()
obs = timestep.observation

Deterministic episodes for debugging

env = composer.Environment(
    task=my_task,
    time_limit=5.0,
    random_state=0,
    fixed_initial_state=True)

# Every call to reset() produces the identical initial state.
# Given the same action sequence, the trajectory is identical.
ts1 = env.reset()
ts2 = env.reset()
# ts1.observation == ts2.observation (element-wise)

Inspecting specs

env = composer.Environment(task=my_task)
env.reset()

print("Action spec:", env.action_spec())
print("Observation spec:")
for name, spec in env.observation_spec().items():
    print(f"  {name}: shape={spec.shape}, dtype={spec.dtype}")
print("Reward spec:", env.reward_spec())
print("Discount spec:", env.discount_spec())

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment