Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Danijar Dreamerv3 Distributed Parallel Training

From Leeroopedia
Knowledge Sources
Domains Reinforcement_Learning, World_Models, Distributed_Training, Model_Based_RL
Last Updated 2026-02-15 09:00 GMT

Overview

End-to-end process for training a DreamerV3 agent using a distributed multi-process architecture with separate actor, learner, replay, and logger processes communicating via RPC.

Description

This workflow implements DreamerV3 training at scale by decomposing the training pipeline into four independent process types that communicate via Portal RPC. The actor process runs the agent policy to select actions for environment workers. The learner process performs gradient updates on batches sampled from the replay buffer. The replay process manages experience storage, sampling, and rate limiting. The logger process aggregates and writes metrics from all other processes. Each environment runs as its own process, sending observations to the actor and receiving actions back. This architecture eliminates the sequential bottleneck between data collection and learning, enabling higher throughput on multi-core machines and clusters.

Usage

Execute this workflow when single-process training becomes a bottleneck, typically for environments with expensive rendering or physics simulation, large model sizes requiring significant GPU time per gradient step, or when scaling to many parallel environments. This mode is also required for multi-machine training where environment processes, replay, and the learner run on different nodes.

Execution Steps

Step 1: Configuration and Address Setup

Parse command-line arguments with --script parallel and configure the network addresses for inter-process communication. Three addresses are auto-assigned (actor, replay, logger) using free ports on localhost, or can be specified manually for multi-machine setups. All factory functions (agent, environments, replay, streams, logger) are serialized with cloudpickle for transmission to worker processes.

Key considerations:

  • The actor_batch setting controls how many environment observations are batched per policy call
  • Setting agent_process to True isolates the agent in its own OS process for memory isolation
  • The remote_envs and remote_replay flags enable launching environment and replay processes on separate machines

Step 2: Process Spawning

Launch the four process types: one agent process (containing actor and learner threads), one logger process, one replay process, and one process per environment instance. Environment processes are split between training and evaluation pools. All processes are managed by the Portal framework which handles lifecycle, error propagation, and cleanup.

Key considerations:

  • The agent process internally spawns actor and learner as threads sharing the same agent object
  • Environment processes are independent OS processes for true parallelism
  • A barrier synchronizes actor and learner to prevent data collection before checkpoint restoration

Step 3: Replay Process Initialization

The replay process constructs training and evaluation replay buffers, creates data stream iterators, and starts an RPC server. It exposes endpoints for inserting transition batches, sampling training/reporting/evaluation batches, and updating replay metadata. A rate limiter (SamplesPerInsert) enforces the configured train ratio by blocking inserts or samples to maintain the desired balance.

Key considerations:

  • The rate limiter blocks either the actor (inserts) or learner (samples) to enforce the train ratio
  • The replay process has its own checkpoint that saves both replay buffers and the rate limiter state
  • Separate sampling functions serve training, reporting, and evaluation streams

Step 4: Actor Process Loop

The actor process runs a batched RPC server that receives observations from environment processes, batches them, runs the agent policy to select actions, and returns actions to each environment. It maintains per-environment carry states for the recurrent policy. Transitions (observations plus actions and policy outputs) are asynchronously forwarded to the replay process for storage and to the logger process for metric tracking.

Key considerations:

  • The actor batches observations from multiple environments for efficient GPU policy inference
  • Per-environment recurrent state is tracked in a dictionary keyed by environment ID
  • Backpressure is managed via maximum in-flight request limits on RPC clients

Step 5: Learner Process Loop

The learner process continuously fetches batches from the replay process via RPC, executes agent training steps (world model loss, imagination rollouts, actor-critic updates), and sends any replay metadata updates (e.g., priorities) back. It manages its own checkpoint (saving agent parameters), periodically runs diagnostic reports, and forwards all metrics to the logger process.

Key considerations:

  • The learner runs in a tight loop without explicit coordination with the actor
  • Rate limiting at the replay process indirectly synchronizes learner and actor throughput
  • Reporting and evaluation use separate data streams fetched from the replay process
  • The slow value network target is updated after each training step

Step 6: Logger Process Aggregation

The logger process runs an RPC server that receives metrics from all other processes (actor, learner, replay). It tracks episode statistics (scores, lengths) from environment transitions, aggregates training and system metrics, and periodically writes everything to the configured logging backends. Each process type contributes its own metrics under distinct prefixes for disambiguation.

Key considerations:

  • Episode statistics are reconstructed from individual transitions received from the actor
  • A timeout mechanism drops episode statistics for environments that stop sending data
  • The logger has its own checkpoint for the global step counter
  • Metrics include per-process FPS, timer breakdowns, system resource usage, and RPC client/server statistics

Step 7: Environment Process Loops

Each environment process runs an independent loop: step the environment with the latest action, send the observation to the actor via RPC, receive the next action, and repeat. Environment processes handle disconnections from the actor gracefully by reconnecting and resetting the episode. The first environment process additionally reports its own FPS and resource usage to the logger.

Key considerations:

  • Each environment is a fully independent OS process for maximum parallelism
  • Environment processes automatically reconnect if the actor process restarts
  • Training and evaluation environments are distinguished by their index relative to args.envs

Execution Diagram

GitHub URL

Workflow Repository