Implementation:Google deepmind Dm control Composer Environment For Soccer
| Metadata | |
|---|---|
| Knowledge Sources | dm_control |
| Domains | Multi-Agent Reinforcement Learning, Environment Interfaces |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for running the multi-agent soccer simulation loop through the composer.Environment class, which wraps a Composer task in a dm_env.Environment-compatible interface that accepts joint actions and returns per-player observations, rewards, and discounts.
Description
The composer.Environment class (defined in dm_control/composer/environment.py) is the runtime engine for all Composer-based environments, including multi-agent soccer. Its key responsibilities include:
reset()-- Optionally recompiles the MJCF model (controlled byrecompile_mjcf_every_episode), runs theinitialize_episode_mjcfandinitialize_episodehooks, resets the physics context, and returns adm_env.TimeStepwithStepType.FIRST.step(action)-- Callsbefore_step(which dispatches per-player actions to walkers), runsn_sub_stepsphysics substeps (each bracketed bybefore_substep/after_substep), callsafter_step, queries the task for reward/discount/termination, and returns aTimeStepwithStepType.MIDorStepType.LAST.- Observation management -- Uses an
observation.Updaterto buffer and deliver per-player observations at the correct frequency. - Error handling -- Optionally catches
PhysicsErrorexceptions and terminates the episode gracefully rather than crashing. - Time limit enforcement -- Terminates the episode when
physics.time()exceedstime_limit.
For soccer, the action argument to step() is a list of numpy arrays (one per player), and the returned timestep.observation is a list of observation dictionaries (one per player). The timestep.reward is a list of scalar arrays.
Usage
The composer.Environment is instantiated automatically by soccer.load(). Users interact with it through the standard reset() / step() interface.
Code Reference
| Attribute | Value |
|---|---|
| Source Location | dm_control/composer/environment.py, lines 294--465
|
| Signature (reset) | def reset(self) -> dm_env.TimeStep
|
| Signature (step) | def step(self, action) -> dm_env.TimeStep
|
| Constructor | Environment(task, time_limit=float('inf'), random_state=None, n_sub_steps=None, raise_exception_on_physics_error=True, strip_singleton_obs_buffer_dim=False, max_reset_attempts=1, recompile_mjcf_every_episode=True, fixed_initial_state=False, delayed_observation_padding=ObservationPadding.ZERO, legacy_step=True)
|
| Import | from dm_control import composer
|
I/O Contract
Inputs (step):
| Parameter | Type | Description |
|---|---|---|
action |
list[np.ndarray] |
One action array per player. Each array's shape must match the player's walker.action_spec.
|
Outputs (step and reset):
| Field | Type | Description |
|---|---|---|
timestep.step_type |
dm_env.StepType |
FIRST after reset, MID during play, LAST on termination.
|
timestep.observation |
list[OrderedDict] |
Per-player observation dictionaries. Each key maps to a numpy array. |
timestep.reward |
list[np.ndarray] or None |
Per-player scalar rewards. None on FIRST step.
|
timestep.discount |
np.ndarray or None |
Scalar discount factor. None on FIRST step.
|
Inputs (constructor):
| Parameter | Type | Description |
|---|---|---|
task |
composer.Task |
A Composer task instance (e.g. soccer.Task or soccer.MultiturnTask).
|
time_limit |
float |
Maximum episode duration in seconds. |
random_state |
int, np.random.RandomState, or None |
Random seed or RNG. |
max_reset_attempts |
int |
Maximum retries on EpisodeInitializationError. Default 1.
|
recompile_mjcf_every_episode |
bool |
If True, recompile the MJCF model on every reset. Default True.
|
Usage Examples
from dm_control.locomotion import soccer
import numpy as np
# Create a 2v2 environment.
env = soccer.load(team_size=2, time_limit=45.0)
# Reset returns the first timestep.
timestep = env.reset()
assert timestep.step_type.name == 'FIRST'
assert timestep.reward is None
print(len(timestep.observation)) # 4 (2 home + 2 away)
# Step with random actions.
action_specs = env.action_spec()
actions = [np.random.uniform(s.minimum, s.maximum, s.shape) for s in action_specs]
timestep = env.step(actions)
assert timestep.step_type.name in ('MID', 'LAST')
print(len(timestep.reward)) # 4 per-player rewards
# Run a full episode.
timestep = env.reset()
total_steps = 0
while timestep.step_type != 2: # dm_env.StepType.LAST == 2
actions = [np.zeros(s.shape) for s in action_specs]
timestep = env.step(actions)
total_steps += 1
print(f"Episode lasted {total_steps} steps")