Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Google deepmind Dm control Composer Environment For Locomotion

From Leeroopedia
Revision as of 12:42, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Google_deepmind_Dm_control_Composer_Environment_For_Locomotion.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Metadata
Knowledge Sources dm_control
Domains Reinforcement Learning, Robotics, Environment Design
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for assembling dm_control locomotion components (walker, arena, task) into a complete reinforcement learning environment that implements the dm_env.Environment interface with episode lifecycle management, observation buffering, and error handling.

Description

The Template:Code class takes a fully configured task object (which already contains references to a walker and an arena) and wraps it in a dm_env-compatible environment. It manages the full episode lifecycle: MJCF model regeneration and recompilation, physics initialization, observation collection and buffering, action application through physics substeps, reward and discount computation, and episode termination. It handles procedural environments where the MJCF model changes between episodes (maze regeneration, corridor resizing) by recompiling the physics model on each reset.

Key features include:

  • Episode reset with retry: If episode initialization fails (e.g., rejection sampling cannot find valid prop positions), the environment retries up to Template:Code times.
  • Configurable recompilation: The Template:Code flag controls whether the MJCF model is regenerated each episode. Disabling this provides a speedup for static environments.
  • Fixed initial state: The Template:Code flag ensures deterministic episode starts by resetting the random state before each initialization.
  • Physics error handling: Physics divergence can either raise an exception or silently terminate the episode with zero reward, depending on Template:Code.
  • Observation management: Observations are collected from all enabled walker and task observables, with support for delayed observations and configurable buffer padding.

Usage

Use Template:Code as the final assembly step after creating a walker, arena, and task. This is the object that RL training loops interact with through Template:Code and Template:Code.

Code Reference

Source Location

Class File Lines
Environment Template:Code L294-517

Signature

class Environment(_CommonEnvironment, dm_env.Environment):
    def __init__(
        self,
        task,
        time_limit=float('inf'),
        random_state=None,
        n_sub_steps=None,
        raise_exception_on_physics_error=True,
        strip_singleton_obs_buffer_dim=False,
        max_reset_attempts=1,
        recompile_mjcf_every_episode=True,
        fixed_initial_state=False,
        delayed_observation_padding=ObservationPadding.ZERO,
        legacy_step=True,
    ):
        ...

Import

from dm_control import composer

I/O Contract

Inputs

Parameter Type Description
task composer.Task A fully configured task instance containing references to walker and arena.
time_limit float Maximum episode duration in seconds. Default Template:Code.
random_state int or np.random.RandomState or None Seed or random state for reproducibility. Default None.
max_reset_attempts int Maximum number of times to retry episode initialization on failure. Default 1.
recompile_mjcf_every_episode bool Whether to regenerate and recompile the MJCF model each episode. Default True.
fixed_initial_state bool If True, reset random state before each episode for determinism. Default False.
raise_exception_on_physics_error bool If True, raise PhysicsError; if False, terminate episode silently. Default True.
strip_singleton_obs_buffer_dim bool If True, remove leading dimension from observations with buffer_size=1. Default False.

Outputs

Method Return Type Description
reset() dm_env.TimeStep First timestep of a new episode (step_type=FIRST, reward=None, discount=None).
step(action) dm_env.TimeStep Timestep after applying action (step_type=MID or LAST, reward, discount, observation).
observation_spec() OrderedDict Maps observation names to specs.Array with shape and dtype.
action_spec() specs.BoundedArray Action bounds and shape from the task.
reward_spec() specs.Array Reward specification.
discount_spec() specs.Array Discount specification.

Usage Examples

Basic locomotion environment with default settings:

from dm_control import composer
from dm_control.locomotion.walkers import cmu_humanoid
from dm_control.locomotion.arenas import floors
from dm_control.locomotion.tasks import go_to_target

walker = cmu_humanoid.CMUHumanoidPositionControlled()
arena = floors.Floor(size=(8, 8))
task = go_to_target.GoToTarget(
    walker=walker, arena=arena,
    physics_timestep=0.005, control_timestep=0.03)

env = composer.Environment(
    task=task,
    time_limit=30,
    strip_singleton_obs_buffer_dim=True)

# Standard RL interaction loop
timestep = env.reset()
while not timestep.last():
    action = env.action_spec().generate_value()  # random action
    timestep = env.step(action)
    print(f"Reward: {timestep.reward}")

Environment with procedural maze and retry logic:

from dm_control import composer

# task is a ManyGoalsMaze with RandomMazeWithTargets arena
env = composer.Environment(
    task=task,
    time_limit=30,
    random_state=42,
    max_reset_attempts=5,
    recompile_mjcf_every_episode=True,
    strip_singleton_obs_buffer_dim=True)

# Each reset generates a new maze layout
timestep = env.reset()
print(env.observation_spec().keys())
print(env.action_spec().shape)

Deterministic environment for debugging:

from dm_control import composer

env = composer.Environment(
    task=task,
    time_limit=10,
    random_state=0,
    fixed_initial_state=True,
    recompile_mjcf_every_episode=True)

# Every reset produces the same initial state
ts1 = env.reset()
ts2 = env.reset()
# ts1.observation and ts2.observation will be identical

Environment with lenient physics error handling:

from dm_control import composer

env = composer.Environment(
    task=task,
    time_limit=60,
    raise_exception_on_physics_error=False,
    max_reset_attempts=3)

# Physics divergence will terminate the episode with reward=0
# rather than raising an exception
timestep = env.reset()
while not timestep.last():
    timestep = env.step(action)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment