Principle:Google deepmind Dm control Multi Agent Episode Loop

Metadata
Knowledge Sources	dm_control
Domains	Multi-Agent Reinforcement Learning, Environment Interfaces
Last Updated	2026-02-15 00:00 GMT

Overview

The multi-agent episode loop is the principle of extending a standard single-agent environment step cycle to accept a list of per-player actions and return a list of per-player observations and rewards, all driven by a single shared physics simulation.

Description

Standard reinforcement learning environments follow the dm_env.Environment interface with scalar actions and observations. In a multi-agent setting, this interface must be generalised:

Joint action input -- The step method receives a list of action arrays, one per agent. Each action is applied to its corresponding walker before the physics simulation advances.
Per-agent observations -- After the physics step, the environment produces a list of observation dictionaries, one per agent. Each dictionary contains that agent's egocentric view of the world.
Per-agent rewards -- The task's get_reward method returns a list of scalar rewards, one per agent, allowing different agents to receive different signals (e.g. +1 for the scoring team, -1 for the other).
Shared simulation -- All agents exist in the same MuJoCo physics world. A single call to physics.step() advances all bodies simultaneously, ensuring physical consistency.
Hook-based lifecycle -- The environment orchestrates a sequence of hooks (initialize_episode_mjcf, after_compile, initialize_episode, before_step, before_substep, after_substep, after_step) that allow the task and all entities to inject logic at each phase of the simulation loop.

The environment also handles MJCF recompilation between episodes, observation buffering, physics error recovery, and time-limit enforcement.

Usage

The multi-agent episode loop is the central abstraction used whenever:

Training multi-agent policies with a shared physics simulation.
Running evaluation rollouts with fixed or learned policies.
Collecting trajectory data for offline RL or imitation learning.

Theoretical Basis

The multi-agent step can be expressed as:

function step(actions: list[array]):
    for player_i, action_i in zip(players, actions):
        player_i.walker.apply_action(physics, action_i)

    for sub in 1..n_sub_steps:
        physics.step()                     # advance shared MuJoCo simulation

    observations = [observe(player_i) for player_i in players]
    rewards      = task.get_reward(physics) # list of per-player scalars
    discount     = task.get_discount(physics)
    terminated   = task.should_terminate_episode(physics) or (time >= limit)

    if terminated:
        return TimeStep(LAST, rewards, discount, observations)
    else:
        return TimeStep(MID,  rewards, discount, observations)

The environment maintains the dm_env.TimeStep protocol with StepType values FIRST, MID, and LAST. The FIRST timestep is produced by reset() and contains reward=None and discount=None.

The physics-to-control timestep ratio is configurable. By default, the physics timestep is 0.005s and the control timestep is 0.025s, yielding 5 sub-steps per control step.

Related Pages

Implementation:Google_deepmind_Dm_control_Composer_Environment_For_Soccer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment