Principle:Google deepmind Dm control Composer Environment Assembly
| Attribute | Value |
|---|---|
| Principle | Composer Environment Assembly |
| Workflow | Composer_Environment_Building |
| Domain | Reinforcement_Learning, Composition |
| Source | dm_control |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Environment assembly is the process of integrating a task, its entity hierarchy, and the observation system into a single reinforcement learning environment that conforms to the standard agent-environment interface.
Description
The preceding principles -- Entity Definition, Arena Definition, Task Definition, Observable Configuration, and Domain Randomization -- each address one facet of building a simulation-based RL environment. The Composer Environment Assembly principle describes how these facets are wired together into a functioning whole.
An assembled environment must:
- Compile the MJCF model from the task's root entity (and all attached sub-entities) into a MuJoCo physics simulation.
- Manage the episode lifecycle: execute the correct sequence of callbacks across all entities and the task during reset and stepping.
- Drive the observation pipeline: create an
Updaterfor the task's enabled observables, call itsreset,prepare_for_next_control_step,update, andget_observationmethods at the correct phases. - Implement the dm_env interface: expose
reset(),step(action),observation_spec(),action_spec(),reward_spec(), anddiscount_spec()so that standard RL agent loops can interact with the environment without knowing its internal structure. - Handle recompilation: when the task or entities modify the MJCF model between episodes (e.g., for domain randomization), the environment must recompile the physics, refresh entity hooks, and reinitialize the observation updater.
- Support robust resetting: episode initialization may fail (e.g., due to invalid randomized configurations). The environment should retry up to a configurable number of times before propagating the error.
- Handle physics errors: if the simulation diverges, the environment can optionally catch the error, terminate the episode with zero reward, and allow a new episode to start.
The assembly also manages the relationship between the control timestep and physics timestep. The number of physics substeps per control step is derived from the task's physics_steps_per_control_step property.
Usage
Use Environment Assembly as the final step in building a Composer RL environment:
- Define entities: Create your robot, props, and other objects as
Entitysubclasses. - Define an arena: Instantiate or subclass
Arena, attach entities. - Define a task: Subclass
Task, set the root entity, implement reward and termination logic, enable observables. - Create the environment: Pass the task to
composer.Environment(...)along with timing and configuration options. - Interact: Call
env.reset()to start an episode andenv.step(action)to advance it, receivingdm_env.TimeSteptuples.
Theoretical Basis
The environment assembly implements the dm_env protocol, which formalizes the agent-environment interaction loop:
timestep = env.reset() # TimeStep(FIRST, None, None, obs)
while not timestep.last():
action = agent.select_action(timestep)
timestep = env.step(action) # TimeStep(MID|LAST, reward, discount, obs)
Internally, the step method executes:
step(action):
hooks.before_step(physics, action, random_state)
observation_updater.prepare_for_next_control_step()
for i in range(n_sub_steps):
hooks.before_substep(physics, action, random_state)
physics.step()
hooks.after_substep(physics, random_state)
if i < n_sub_steps - 1:
observation_updater.update()
hooks.after_step(physics, random_state)
observation_updater.update()
reward = task.get_reward(physics)
discount = task.get_discount(physics)
done = task.should_terminate_episode(physics) or time >= time_limit
return TimeStep(MID or LAST, reward, discount, observation)
The reset method executes:
reset():
for attempt in range(max_reset_attempts):
try:
if recompile_mjcf_every_episode:
hooks.initialize_episode_mjcf(random_state)
recompile_physics()
hooks.after_compile(physics, random_state)
hooks.initialize_episode(physics, random_state)
observation_updater.reset(physics, random_state)
return TimeStep(FIRST, None, None, observation)
except EpisodeInitializationError:
if attempt == max_reset_attempts - 1: raise
The hooks system optimizes performance by scanning all entity callbacks at compile time and skipping any that are trivial (have empty bodies), avoiding unnecessary function call overhead in environments with many entities.
Related Pages
- Implementation:Google_deepmind_Dm_control_Composer_Environment
- Principle:Google_deepmind_Dm_control_Entity_Definition
- Principle:Google_deepmind_Dm_control_Arena_Definition
- Principle:Google_deepmind_Dm_control_Task_Definition
- Principle:Google_deepmind_Dm_control_Observable_Configuration
- Principle:Google_deepmind_Dm_control_Domain_Randomization
- Heuristic:Google_deepmind_Dm_control_Physics_Timestep_Configuration
- Heuristic:Google_deepmind_Dm_control_Prop_Settling_Physics_Tuning