Principle:Google deepmind Dm control Multi Agent Episode Loop
| Metadata | |
|---|---|
| Knowledge Sources | dm_control |
| Domains | Multi-Agent Reinforcement Learning, Environment Interfaces |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
The multi-agent episode loop is the principle of extending a standard single-agent environment step cycle to accept a list of per-player actions and return a list of per-player observations and rewards, all driven by a single shared physics simulation.
Description
Standard reinforcement learning environments follow the dm_env.Environment interface with scalar actions and observations. In a multi-agent setting, this interface must be generalised:
- Joint action input -- The
stepmethod receives a list of action arrays, one per agent. Each action is applied to its corresponding walker before the physics simulation advances. - Per-agent observations -- After the physics step, the environment produces a list of observation dictionaries, one per agent. Each dictionary contains that agent's egocentric view of the world.
- Per-agent rewards -- The task's
get_rewardmethod returns a list of scalar rewards, one per agent, allowing different agents to receive different signals (e.g. +1 for the scoring team, -1 for the other). - Shared simulation -- All agents exist in the same MuJoCo physics world. A single call to
physics.step()advances all bodies simultaneously, ensuring physical consistency. - Hook-based lifecycle -- The environment orchestrates a sequence of hooks (
initialize_episode_mjcf,after_compile,initialize_episode,before_step,before_substep,after_substep,after_step) that allow the task and all entities to inject logic at each phase of the simulation loop.
The environment also handles MJCF recompilation between episodes, observation buffering, physics error recovery, and time-limit enforcement.
Usage
The multi-agent episode loop is the central abstraction used whenever:
- Training multi-agent policies with a shared physics simulation.
- Running evaluation rollouts with fixed or learned policies.
- Collecting trajectory data for offline RL or imitation learning.
Theoretical Basis
The multi-agent step can be expressed as:
function step(actions: list[array]):
for player_i, action_i in zip(players, actions):
player_i.walker.apply_action(physics, action_i)
for sub in 1..n_sub_steps:
physics.step() # advance shared MuJoCo simulation
observations = [observe(player_i) for player_i in players]
rewards = task.get_reward(physics) # list of per-player scalars
discount = task.get_discount(physics)
terminated = task.should_terminate_episode(physics) or (time >= limit)
if terminated:
return TimeStep(LAST, rewards, discount, observations)
else:
return TimeStep(MID, rewards, discount, observations)
The environment maintains the dm_env.TimeStep protocol with StepType values FIRST, MID, and LAST. The FIRST timestep is produced by reset() and contains reward=None and discount=None.
The physics-to-control timestep ratio is configurable. By default, the physics timestep is 0.005s and the control timestep is 0.025s, yielding 5 sub-steps per control step.