Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Google deepmind Dm control Game Rules Configuration

From Leeroopedia
Metadata
Knowledge Sources dm_control
Domains Multi-Agent Reinforcement Learning, Game Design
Last Updated 2026-02-15 00:00 GMT

Overview

Game rules configuration is the principle of encoding scoring logic, reward distribution, episode termination conditions, and ball-reset mechanics into a task object so that multi-agent competition follows well-defined rules.

Description

A multi-agent competitive environment needs a formal specification of what counts as winning, how progress is rewarded, and when an episode ends. Game rules configuration addresses:

  • Scoring detection -- The task monitors arena sensors to detect when a goal has been scored and by which team.
  • Per-agent reward assignment -- Upon a scoring event, every agent receives a signed scalar reward: +1 for the scoring team and -1 for the conceding team. When no goal is scored, all rewards are 0.
  • Episode termination -- The task decides whether to end the episode on the first goal (single-turn) or to reinitialise positions and continue play until a time limit (multi-turn).
  • Out-of-bounds handling -- When the ball leaves the pitch, a throw-in mechanic repositions it slightly inward and resets its velocity.

These rules are encoded declaratively in a task object that the environment loop queries at every timestep.

Usage

Game rules configuration is needed whenever:

  • A researcher wants to switch between episodic (terminate-on-goal) and continuing (multi-turn) training regimes.
  • The reward function needs to be inspected or replaced.
  • Custom termination criteria (e.g. maximum score difference) are desired.

Theoretical Basis

The reward and termination logic implement a team zero-sum structure. Let Gt{HOME,AWAY,} be the goal event at timestep t. The per-player reward for player p on team τp is:

r_p(t) =
  +1   if G_t = tau_p          (player's team scored)
  -1   if G_t != null and G_t != tau_p  (opponent scored)
   0   if G_t = null            (no goal)

The discount factor follows standard RL conventions:

Single-turn (Task):
  gamma(t) = 0  if G_t != null   (episode ends)
  gamma(t) = 1  otherwise

Multi-turn (MultiturnTask):
  gamma(t) = 1  always           (episode never terminates on a goal)

In the multi-turn variant, positions are reinitialised after every goal and ball entity trackers are reset, but the episode clock continues.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment