Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Google deepmind Dm control Composer Environment For Soccer

From Leeroopedia
Metadata
Knowledge Sources dm_control
Domains Multi-Agent Reinforcement Learning, Environment Interfaces
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for running the multi-agent soccer simulation loop through the composer.Environment class, which wraps a Composer task in a dm_env.Environment-compatible interface that accepts joint actions and returns per-player observations, rewards, and discounts.

Description

The composer.Environment class (defined in dm_control/composer/environment.py) is the runtime engine for all Composer-based environments, including multi-agent soccer. Its key responsibilities include:

  • reset() -- Optionally recompiles the MJCF model (controlled by recompile_mjcf_every_episode), runs the initialize_episode_mjcf and initialize_episode hooks, resets the physics context, and returns a dm_env.TimeStep with StepType.FIRST.
  • step(action) -- Calls before_step (which dispatches per-player actions to walkers), runs n_sub_steps physics substeps (each bracketed by before_substep/after_substep), calls after_step, queries the task for reward/discount/termination, and returns a TimeStep with StepType.MID or StepType.LAST.
  • Observation management -- Uses an observation.Updater to buffer and deliver per-player observations at the correct frequency.
  • Error handling -- Optionally catches PhysicsError exceptions and terminates the episode gracefully rather than crashing.
  • Time limit enforcement -- Terminates the episode when physics.time() exceeds time_limit.

For soccer, the action argument to step() is a list of numpy arrays (one per player), and the returned timestep.observation is a list of observation dictionaries (one per player). The timestep.reward is a list of scalar arrays.

Usage

The composer.Environment is instantiated automatically by soccer.load(). Users interact with it through the standard reset() / step() interface.

Code Reference

Attribute Value
Source Location dm_control/composer/environment.py, lines 294--465
Signature (reset) def reset(self) -> dm_env.TimeStep
Signature (step) def step(self, action) -> dm_env.TimeStep
Constructor Environment(task, time_limit=float('inf'), random_state=None, n_sub_steps=None, raise_exception_on_physics_error=True, strip_singleton_obs_buffer_dim=False, max_reset_attempts=1, recompile_mjcf_every_episode=True, fixed_initial_state=False, delayed_observation_padding=ObservationPadding.ZERO, legacy_step=True)
Import from dm_control import composer

I/O Contract

Inputs (step):

Parameter Type Description
action list[np.ndarray] One action array per player. Each array's shape must match the player's walker.action_spec.

Outputs (step and reset):

Field Type Description
timestep.step_type dm_env.StepType FIRST after reset, MID during play, LAST on termination.
timestep.observation list[OrderedDict] Per-player observation dictionaries. Each key maps to a numpy array.
timestep.reward list[np.ndarray] or None Per-player scalar rewards. None on FIRST step.
timestep.discount np.ndarray or None Scalar discount factor. None on FIRST step.

Inputs (constructor):

Parameter Type Description
task composer.Task A Composer task instance (e.g. soccer.Task or soccer.MultiturnTask).
time_limit float Maximum episode duration in seconds.
random_state int, np.random.RandomState, or None Random seed or RNG.
max_reset_attempts int Maximum retries on EpisodeInitializationError. Default 1.
recompile_mjcf_every_episode bool If True, recompile the MJCF model on every reset. Default True.

Usage Examples

from dm_control.locomotion import soccer
import numpy as np

# Create a 2v2 environment.
env = soccer.load(team_size=2, time_limit=45.0)

# Reset returns the first timestep.
timestep = env.reset()
assert timestep.step_type.name == 'FIRST'
assert timestep.reward is None
print(len(timestep.observation))  # 4 (2 home + 2 away)

# Step with random actions.
action_specs = env.action_spec()
actions = [np.random.uniform(s.minimum, s.maximum, s.shape) for s in action_specs]
timestep = env.step(actions)
assert timestep.step_type.name in ('MID', 'LAST')
print(len(timestep.reward))  # 4 per-player rewards

# Run a full episode.
timestep = env.reset()
total_steps = 0
while timestep.step_type != 2:  # dm_env.StepType.LAST == 2
    actions = [np.zeros(s.shape) for s in action_specs]
    timestep = env.step(actions)
    total_steps += 1
print(f"Episode lasted {total_steps} steps")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment