Implementation:Google deepmind Dm control Composer Environment For Soccer

Metadata
Knowledge Sources	dm_control
Domains	Multi-Agent Reinforcement Learning, Environment Interfaces
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for running the multi-agent soccer simulation loop through the composer.Environment class, which wraps a Composer task in a dm_env.Environment-compatible interface that accepts joint actions and returns per-player observations, rewards, and discounts.

Description

The composer.Environment class (defined in dm_control/composer/environment.py) is the runtime engine for all Composer-based environments, including multi-agent soccer. Its key responsibilities include:

reset() -- Optionally recompiles the MJCF model (controlled by recompile_mjcf_every_episode), runs the initialize_episode_mjcf and initialize_episode hooks, resets the physics context, and returns a dm_env.TimeStep with StepType.FIRST.
step(action) -- Calls before_step (which dispatches per-player actions to walkers), runs n_sub_steps physics substeps (each bracketed by before_substep/after_substep), calls after_step, queries the task for reward/discount/termination, and returns a TimeStep with StepType.MID or StepType.LAST.
Observation management -- Uses an observation.Updater to buffer and deliver per-player observations at the correct frequency.
Error handling -- Optionally catches PhysicsError exceptions and terminates the episode gracefully rather than crashing.
Time limit enforcement -- Terminates the episode when physics.time() exceeds time_limit.

For soccer, the action argument to step() is a list of numpy arrays (one per player), and the returned timestep.observation is a list of observation dictionaries (one per player). The timestep.reward is a list of scalar arrays.

Usage

The composer.Environment is instantiated automatically by soccer.load(). Users interact with it through the standard reset() / step() interface.

Code Reference

Attribute	Value
Source Location	`dm_control/composer/environment.py`, lines 294--465
Signature (reset)	`def reset(self) -> dm_env.TimeStep`
Signature (step)	`def step(self, action) -> dm_env.TimeStep`
Constructor	`Environment(task, time_limit=float('inf'), random_state=None, n_sub_steps=None, raise_exception_on_physics_error=True, strip_singleton_obs_buffer_dim=False, max_reset_attempts=1, recompile_mjcf_every_episode=True, fixed_initial_state=False, delayed_observation_padding=ObservationPadding.ZERO, legacy_step=True)`
Import	`from dm_control import composer`

I/O Contract

Inputs (step):

Parameter	Type	Description
`action`	`list[np.ndarray]`	One action array per player. Each array's shape must match the player's `walker.action_spec`.

Outputs (step and reset):

Field	Type	Description
`timestep.step_type`	`dm_env.StepType`	`FIRST` after reset, `MID` during play, `LAST` on termination.
`timestep.observation`	`list[OrderedDict]`	Per-player observation dictionaries. Each key maps to a numpy array.
`timestep.reward`	`list[np.ndarray]` or `None`	Per-player scalar rewards. `None` on `FIRST` step.
`timestep.discount`	`np.ndarray` or `None`	Scalar discount factor. `None` on `FIRST` step.

Inputs (constructor):

Parameter	Type	Description
`task`	`composer.Task`	A Composer task instance (e.g. `soccer.Task` or `soccer.MultiturnTask`).
`time_limit`	`float`	Maximum episode duration in seconds.
`random_state`	`int`, `np.random.RandomState`, or `None`	Random seed or RNG.
`max_reset_attempts`	`int`	Maximum retries on `EpisodeInitializationError`. Default `1`.
`recompile_mjcf_every_episode`	`bool`	If `True`, recompile the MJCF model on every reset. Default `True`.

Usage Examples

from dm_control.locomotion import soccer
import numpy as np

# Create a 2v2 environment.
env = soccer.load(team_size=2, time_limit=45.0)

# Reset returns the first timestep.
timestep = env.reset()
assert timestep.step_type.name == 'FIRST'
assert timestep.reward is None
print(len(timestep.observation))  # 4 (2 home + 2 away)

# Step with random actions.
action_specs = env.action_spec()
actions = [np.random.uniform(s.minimum, s.maximum, s.shape) for s in action_specs]
timestep = env.step(actions)
assert timestep.step_type.name in ('MID', 'LAST')
print(len(timestep.reward))  # 4 per-player rewards

# Run a full episode.
timestep = env.reset()
total_steps = 0
while timestep.step_type != 2:  # dm_env.StepType.LAST == 2
    actions = [np.zeros(s.shape) for s in action_specs]
    timestep = env.step(actions)
    total_steps += 1
print(f"Episode lasted {total_steps} steps")

Related Pages

Principle:Google_deepmind_Dm_control_Multi_Agent_Episode_Loop

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment