Principle:Google deepmind Dm control Composer Environment Assembly

Attribute	Value
Principle	Composer Environment Assembly
Workflow	Composer_Environment_Building
Domain	Reinforcement_Learning, Composition
Source	dm_control
Last Updated	2026-02-15 00:00 GMT

Overview

Environment assembly is the process of integrating a task, its entity hierarchy, and the observation system into a single reinforcement learning environment that conforms to the standard agent-environment interface.

Description

The preceding principles -- Entity Definition, Arena Definition, Task Definition, Observable Configuration, and Domain Randomization -- each address one facet of building a simulation-based RL environment. The Composer Environment Assembly principle describes how these facets are wired together into a functioning whole.

An assembled environment must:

Compile the MJCF model from the task's root entity (and all attached sub-entities) into a MuJoCo physics simulation.
Manage the episode lifecycle: execute the correct sequence of callbacks across all entities and the task during reset and stepping.
Drive the observation pipeline: create an Updater for the task's enabled observables, call its reset, prepare_for_next_control_step, update, and get_observation methods at the correct phases.
Implement the dm_env interface: expose reset(), step(action), observation_spec(), action_spec(), reward_spec(), and discount_spec() so that standard RL agent loops can interact with the environment without knowing its internal structure.
Handle recompilation: when the task or entities modify the MJCF model between episodes (e.g., for domain randomization), the environment must recompile the physics, refresh entity hooks, and reinitialize the observation updater.
Support robust resetting: episode initialization may fail (e.g., due to invalid randomized configurations). The environment should retry up to a configurable number of times before propagating the error.
Handle physics errors: if the simulation diverges, the environment can optionally catch the error, terminate the episode with zero reward, and allow a new episode to start.

The assembly also manages the relationship between the control timestep and physics timestep. The number of physics substeps per control step is derived from the task's physics_steps_per_control_step property.

Usage

Use Environment Assembly as the final step in building a Composer RL environment:

Define entities: Create your robot, props, and other objects as Entity subclasses.
Define an arena: Instantiate or subclass Arena, attach entities.
Define a task: Subclass Task, set the root entity, implement reward and termination logic, enable observables.
Create the environment: Pass the task to composer.Environment(...) along with timing and configuration options.
Interact: Call env.reset() to start an episode and env.step(action) to advance it, receiving dm_env.TimeStep tuples.

Theoretical Basis

The environment assembly implements the dm_env protocol, which formalizes the agent-environment interaction loop:

timestep = env.reset()                   # TimeStep(FIRST, None, None, obs)
while not timestep.last():
    action = agent.select_action(timestep)
    timestep = env.step(action)           # TimeStep(MID|LAST, reward, discount, obs)

Internally, the step method executes:

step(action):
    hooks.before_step(physics, action, random_state)
    observation_updater.prepare_for_next_control_step()
    for i in range(n_sub_steps):
        hooks.before_substep(physics, action, random_state)
        physics.step()
        hooks.after_substep(physics, random_state)
        if i < n_sub_steps - 1:
            observation_updater.update()
    hooks.after_step(physics, random_state)
    observation_updater.update()
    reward = task.get_reward(physics)
    discount = task.get_discount(physics)
    done = task.should_terminate_episode(physics) or time >= time_limit
    return TimeStep(MID or LAST, reward, discount, observation)

The reset method executes:

reset():
    for attempt in range(max_reset_attempts):
        try:
            if recompile_mjcf_every_episode:
                hooks.initialize_episode_mjcf(random_state)
                recompile_physics()
                hooks.after_compile(physics, random_state)
            hooks.initialize_episode(physics, random_state)
            observation_updater.reset(physics, random_state)
            return TimeStep(FIRST, None, None, observation)
        except EpisodeInitializationError:
            if attempt == max_reset_attempts - 1: raise

The hooks system optimizes performance by scanning all entity callbacks at compile time and skipping any that are trivial (have empty bodies), avoiding unnecessary function call overhead in environments with many entities.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment