Workflow:Danijar Dreamerv3 Single Process Training

Knowledge Sources	DreamerV3 Mastering Diverse Domains through World Models DreamerV3 Project
Domains	Reinforcement_Learning, World_Models, Model_Based_RL
Last Updated	2026-02-15 09:00 GMT

Overview

End-to-end process for training a DreamerV3 world-model-based reinforcement learning agent on a target environment using a single process.

Description

This workflow covers the complete single-process training pipeline for DreamerV3, a model-based reinforcement learning algorithm that learns a world model from environment interactions and optimizes a policy through imagined trajectories. The agent encodes sensory inputs into discrete categorical representations via an RSSM (Recurrent State-Space Model), predicts future latent states and rewards, and trains an actor-critic policy entirely within the learned latent dynamics. The process spans configuration loading, environment creation, agent initialization, experience collection via a parallel driver, replay buffer management, and the iterative training loop with periodic logging and checkpointing.

Usage

Execute this workflow when you want to train a DreamerV3 agent from scratch on any supported environment (Atari, DMC, Crafter, DMLab, Minecraft, ProcGen, etc.) using a single machine. This is the default and most common training mode, suitable for environments where data collection and learning can share the same process without bottlenecking each other.

Execution Steps

Step 1: Configuration Loading

Parse command-line arguments and load the hierarchical YAML configuration. The system reads default hyperparameters from the config file, then overlays any named presets (e.g., atari, crafter, size200m) in the order specified, and finally applies any individual flag overrides. This produces a single merged configuration object controlling all aspects of the run.

Key considerations:

Named config blocks are composable and override defaults in order
The debug config block reduces all sizes for fast iteration
The logdir supports a timestamp placeholder for unique run directories

Step 2: Environment Construction

Instantiate the target environment based on the task config field. The task string is split into a suite name and task name (e.g., atari_pong becomes suite atari, task pong). The appropriate environment wrapper class is dynamically loaded, suite-specific settings are applied, and a chain of composable wrappers normalizes actions, unifies dtypes, validates spaces, and clips continuous actions.

Key considerations:

Each environment suite has specific rendering and observation settings
Continuous action spaces are automatically normalized to a standard range
A temporary environment instance is created to extract observation and action spaces for agent construction, then closed

Step 3: Agent Initialization

Construct the DreamerV3 agent with the observed action and observation spaces. This builds the full neural network architecture: an encoder (CNN for images, MLP for vectors), the RSSM world model (recurrent deterministic state plus discrete stochastic state), a decoder for observation reconstruction, reward and continuation prediction heads, an actor (policy) head, and a value head with a slow-moving target network. The optimizer is configured with adaptive gradient clipping, RMS-based scaling, and optional learning rate scheduling.

Key considerations:

Model size is controlled by named size presets (1M to 400M parameters)
The RSSM uses block-diagonal GRU dynamics with grouped linear layers
Value normalization (percentile-based return normalization) is initialized for stable training

Step 4: Replay Buffer Setup

Create the experience replay buffer with a specified capacity and sampling strategy. The buffer stores fixed-length sequences of transitions and supports online data (most recent transitions overwrite oldest). Sampling can be uniform, prioritized, recency-weighted, or a mixture of these strategies.

Key considerations:

Sequence length is determined by batch_length times consec_train plus replay_context
Total capacity must exceed batch_size times sequence length
Prioritized replay is incompatible with low-precision (float16) training due to gradient scaling artifacts

Step 5: Checkpoint Restoration

Initialize or restore training state from a checkpoint. The checkpoint system saves and loads the step counter, full agent state (all neural network parameters), and the replay buffer contents. If a from_checkpoint path is specified, a partial load is performed using a regex filter to select which parameters to restore.

Key considerations:

Resuming training requires pointing logdir to the same directory as the original run
Checkpoint incompatibility manifests as a Too many leaves for PyTreeDef error
Partial checkpoint loading enables transfer learning from pretrained models

Step 6: Data Collection and Training Loop

Run the main training loop that interleaves environment interaction with model updates. A Driver manages parallel environment instances, collecting transitions by running the agent policy and forwarding each step to the replay buffer and logging callbacks. After each group of environment steps, the training function is called at a configured ratio (e.g., 32 gradient updates per environment step), sampling batches from replay, computing the combined world model and actor-critic loss, and updating all parameters.

Key considerations:

The train_ratio controls the number of gradient steps per environment step
Training does not begin until the replay buffer has accumulated at least one full batch
Each training step processes a batch of sequences, computing world model reconstruction loss, RSSM dynamics loss, reward/continuation prediction loss, and actor-critic losses via imagined rollouts

Step 7: Logging and Reporting

Periodically write training metrics, episode statistics, and diagnostic reports. Metrics include training losses, FPS counters, replay buffer statistics, and system resource usage. Reporting runs the agent in evaluation mode on replay data to produce open-loop video predictions (observing real frames then imagining forward) which visualize world model quality.

Key considerations:

Log outputs support JSONL files, Scope summaries, TensorBoard, and WandB
Open-loop videos show ground truth, predicted frames, and error maps side by side
The report frequency and log frequency are independently configurable

Execution Diagram

GitHub URL

Workflow Repository