Principle:Google deepmind Dm control Per Player Observation Design
| Metadata | |
|---|---|
| Knowledge Sources | dm_control |
| Domains | Multi-Agent Reinforcement Learning, Observation Design |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Per-player observation design is the principle of constructing an independent, egocentric observation vector for each agent in a multi-agent environment so that every agent perceives the world from its own reference frame.
Description
In a multi-agent setting, each agent must receive observations that are both informative and appropriately scoped. Per-player observation design specifies:
- Proprioception -- Each agent observes its own joint angles, velocities, and kinematic sensor readings (accelerometer, gyroscope, velocimeter). The previous action is also included.
- Egocentric ball state -- The ball's position, linear velocity, and angular velocity are expressed in the observing agent's body frame. This removes the need for the policy to learn coordinate transforms.
- Egocentric teammate and opponent state -- The position, linear velocity, orientation, and end-effector positions of every other player are transformed into the observing player's reference frame. Teammates and opponents are given distinct prefixed names (e.g.
teammate_0,opponent_1). - Arena landmarks -- Goal posts and field corners are provided as egocentric vectors so the agent can orient itself on the pitch.
- Game statistics -- Derived quantities such as velocity toward ball, closest-teammate velocity toward ball, forward velocity, ball velocity toward the opponent's goal, average teammate distance, and scoring indicators are included for potential reward shaping.
- Interception events -- Optional binary indicators for ball reception and opponent interception events at multiple distance thresholds (5m, 10m, 15m).
The observation system is designed to be modular: different adders can be composed to include or exclude categories of observations.
Usage
Per-player observation design is relevant when:
- Training decentralised policies where each agent has a private observation.
- Designing reward shaping signals from observable statistics.
- Extending the observation space with custom features (e.g. communication channels).
Theoretical Basis
Egocentric observations can be formalised through MuJoCo frame sensors. For a quantity q in the world frame and a player with root body position p and orientation R, the egocentric observation is:
q_ego = R^T * (q_world - p)
MuJoCo implements this via framepos, framelinvel, frameangvel, and framexaxis/frameyaxis/framezaxis sensors with reftype="body" and refname set to the observing player's root body. This hardware-level sensor approach avoids manual matrix multiplication and is computed natively within the physics engine at each timestep.
The observation for player i in a game with N total players is:
o_i = concat(
proprio_i, # joint angles, velocities, prev action
ball_ego_pos_i, ball_ego_vel_i, # ball in player i's frame
[teammate_j_ego for j in teammates(i)],
[opponent_k_ego for k in opponents(i)],
arena_landmarks_ego_i, # goals and field corners
stats_i # derived game statistics
)