Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Google deepmind Dm control Composer Environment

From Leeroopedia
Attribute Value
Implementation Composer Environment
Workflow Composer_Environment_Building
Domain Reinforcement_Learning, Composition
Source dm_control
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for assembling a complete reinforcement learning environment from a Composer Task and its entity hierarchy, exposing the standard dm_env.Environment interface with support for MJCF recompilation, multi-rate observations, robust resetting, and physics error handling.

Description

The Environment class in dm_control.composer.environment is the top-level entry point for running Composer-based RL experiments. It inherits from both _CommonEnvironment (which handles physics compilation, observation updater creation, and the hooks system) and dm_env.Environment (which defines the standard RL interface).

Constructor parameters:

  • task -- a Task instance that defines the root entity, reward, and termination logic.
  • time_limit -- maximum episode duration in seconds (default: infinity).
  • random_state -- an integer seed or np.random.RandomState for reproducibility.
  • max_reset_attempts -- how many times to retry reset if EpisodeInitializationError is raised (default: 1, i.e., no retry).
  • recompile_mjcf_every_episode -- if True (default), calls initialize_episode_mjcf and recompiles the physics at the start of every episode. Set to False for a speedup when the model does not change between episodes.
  • raise_exception_on_physics_error -- if False, physics divergence terminates the episode with a warning instead of raising.
  • strip_singleton_obs_buffer_dim -- if True, observations with buffer_size=1 have the leading buffer dimension squeezed.
  • fixed_initial_state -- if True, every episode starts from the same random state, making trajectories deterministic given the same actions.
  • delayed_observation_padding -- ObservationPadding.ZERO or ObservationPadding.INITIAL_VALUE, controlling how delayed observation buffers are initialized.
  • legacy_step -- if True (default), steps the physics state with up-to-date position and velocity dependent fields.

Key methods:

  • reset() -- initializes a new episode. Calls initialize_episode_mjcf, recompiles physics (if configured), calls initialize_episode, resets the observation updater, and returns the first TimeStep.
  • step(action) -- advances the environment by one control step. Calls the before/after hooks, steps the physics n_sub_steps times, updates observations, computes reward and discount, checks termination, and returns a TimeStep.
  • observation_spec() -- returns the observation specification from the updater.
  • action_spec() -- delegates to task.action_spec(physics).
  • reward_spec() / discount_spec() -- return custom specs from the task or the dm_env defaults.
  • close() -- frees the underlying physics resources.

Hooks system:

The internal _EnvironmentHooks object scans all entities in the task's entity tree and memoizes non-trivial callback methods. During stepping, only non-empty callbacks are invoked, avoiding the overhead of calling no-op methods on many entities. Extra hooks can be added via add_extra_hook(hook_name, hook_callable).

Usage

Instantiate composer.Environment with a configured Task and interact with it using the standard dm_env loop. Adjust constructor parameters to control recompilation frequency, robustness, and observation buffering.

Code Reference

Attribute Value
Source Location dm_control/composer/environment.py:L294-517
Signature Environment.__init__(self, task, time_limit=float('inf'), random_state=None, n_sub_steps=None, raise_exception_on_physics_error=True, strip_singleton_obs_buffer_dim=False, max_reset_attempts=1, recompile_mjcf_every_episode=True, fixed_initial_state=False, delayed_observation_padding=ObservationPadding.ZERO, legacy_step=True)
Signature (reset) Environment.reset(self) -> dm_env.TimeStep
Signature (step) Environment.step(self, action) -> dm_env.TimeStep
Signature (observation_spec) Environment.observation_spec(self) -> OrderedDict
Signature (action_spec) Environment.action_spec(self) -> specs.BoundedArray
Import from dm_control import composer or from dm_control.composer import environment

I/O Contract

Inputs

Name Type Description
task Task A fully configured Composer task with root entity, reward, and termination logic
time_limit float Maximum episode duration in seconds
random_state int or np.random.RandomState Seed or RNG for reproducibility
max_reset_attempts int Number of reset retries on EpisodeInitializationError
recompile_mjcf_every_episode bool Whether to recompile physics each episode
action np.ndarray (for step) Agent action matching action_spec

Outputs

Name Type Description
reset() return dm_env.TimeStep TimeStep(FIRST, None, None, observation)
step() return dm_env.TimeStep TimeStep(MID or LAST, reward, discount, observation)
observation_spec() return OrderedDict[str, specs.Array] Maps observation names to array specs
action_spec() return specs.BoundedArray Shape, dtype, and bounds of the action space
reward_spec() return specs.Array Specification of the reward signal
discount_spec() return specs.BoundedArray Specification of the discount factor
physics weakref.ProxyType[mjcf.Physics] Weak reference to the current physics instance
task Task The task driving this environment

Usage Examples

Basic environment creation and interaction

from dm_control import composer
import numpy as np


# Assume ReachTask is defined as in the Task implementation page
task = ReachTask(robot=my_robot, target_entity=my_target)

env = composer.Environment(
    task=task,
    time_limit=10.0,
    random_state=42)

# Standard dm_env interaction loop
timestep = env.reset()
while not timestep.last():
    action = np.random.uniform(
        low=env.action_spec().minimum,
        high=env.action_spec().maximum)
    timestep = env.step(action)
    print(f"Reward: {timestep.reward}")

env.close()

Environment with domain randomization and robust resetting

env = composer.Environment(
    task=randomized_task,
    time_limit=20.0,
    random_state=123,
    max_reset_attempts=5,
    recompile_mjcf_every_episode=True,
    raise_exception_on_physics_error=False)

# The environment will retry up to 5 times if initialization fails,
# and will gracefully handle physics divergence by terminating the episode.
timestep = env.reset()

Faster environment without per-episode recompilation

# When the MJCF model does not change between episodes,
# skip recompilation for a significant speedup.
env = composer.Environment(
    task=static_task,
    time_limit=30.0,
    recompile_mjcf_every_episode=False,
    strip_singleton_obs_buffer_dim=True)

# Observations with buffer_size=1 will not have a leading dimension.
timestep = env.reset()
obs = timestep.observation

Deterministic episodes for debugging

env = composer.Environment(
    task=my_task,
    time_limit=5.0,
    random_state=0,
    fixed_initial_state=True)

# Every call to reset() produces the identical initial state.
# Given the same action sequence, the trajectory is identical.
ts1 = env.reset()
ts2 = env.reset()
# ts1.observation == ts2.observation (element-wise)

Inspecting specs

env = composer.Environment(task=my_task)
env.reset()

print("Action spec:", env.action_spec())
print("Observation spec:")
for name, spec in env.observation_spec().items():
    print(f"  {name}: shape={spec.shape}, dtype={spec.dtype}")
print("Reward spec:", env.reward_spec())
print("Discount spec:", env.discount_spec())

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment