Workflow:Danijar Dreamerv3 Single Process Training
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, World_Models, Model_Based_RL |
| Last Updated | 2026-02-15 09:00 GMT |
Overview
End-to-end process for training a DreamerV3 world-model-based reinforcement learning agent on a target environment using a single process.
Description
This workflow covers the complete single-process training pipeline for DreamerV3, a model-based reinforcement learning algorithm that learns a world model from environment interactions and optimizes a policy through imagined trajectories. The agent encodes sensory inputs into discrete categorical representations via an RSSM (Recurrent State-Space Model), predicts future latent states and rewards, and trains an actor-critic policy entirely within the learned latent dynamics. The process spans configuration loading, environment creation, agent initialization, experience collection via a parallel driver, replay buffer management, and the iterative training loop with periodic logging and checkpointing.
Usage
Execute this workflow when you want to train a DreamerV3 agent from scratch on any supported environment (Atari, DMC, Crafter, DMLab, Minecraft, ProcGen, etc.) using a single machine. This is the default and most common training mode, suitable for environments where data collection and learning can share the same process without bottlenecking each other.
Execution Steps
Step 1: Configuration Loading
Parse command-line arguments and load the hierarchical YAML configuration. The system reads default hyperparameters from the config file, then overlays any named presets (e.g., atari, crafter, size200m) in the order specified, and finally applies any individual flag overrides. This produces a single merged configuration object controlling all aspects of the run.
Key considerations:
- Named config blocks are composable and override defaults in order
- The debug config block reduces all sizes for fast iteration
- The logdir supports a timestamp placeholder for unique run directories
Step 2: Environment Construction
Instantiate the target environment based on the task config field. The task string is split into a suite name and task name (e.g., atari_pong becomes suite atari, task pong). The appropriate environment wrapper class is dynamically loaded, suite-specific settings are applied, and a chain of composable wrappers normalizes actions, unifies dtypes, validates spaces, and clips continuous actions.
Key considerations:
- Each environment suite has specific rendering and observation settings
- Continuous action spaces are automatically normalized to a standard range
- A temporary environment instance is created to extract observation and action spaces for agent construction, then closed
Step 3: Agent Initialization
Construct the DreamerV3 agent with the observed action and observation spaces. This builds the full neural network architecture: an encoder (CNN for images, MLP for vectors), the RSSM world model (recurrent deterministic state plus discrete stochastic state), a decoder for observation reconstruction, reward and continuation prediction heads, an actor (policy) head, and a value head with a slow-moving target network. The optimizer is configured with adaptive gradient clipping, RMS-based scaling, and optional learning rate scheduling.
Key considerations:
- Model size is controlled by named size presets (1M to 400M parameters)
- The RSSM uses block-diagonal GRU dynamics with grouped linear layers
- Value normalization (percentile-based return normalization) is initialized for stable training
Step 4: Replay Buffer Setup
Create the experience replay buffer with a specified capacity and sampling strategy. The buffer stores fixed-length sequences of transitions and supports online data (most recent transitions overwrite oldest). Sampling can be uniform, prioritized, recency-weighted, or a mixture of these strategies.
Key considerations:
- Sequence length is determined by batch_length times consec_train plus replay_context
- Total capacity must exceed batch_size times sequence length
- Prioritized replay is incompatible with low-precision (float16) training due to gradient scaling artifacts
Step 5: Checkpoint Restoration
Initialize or restore training state from a checkpoint. The checkpoint system saves and loads the step counter, full agent state (all neural network parameters), and the replay buffer contents. If a from_checkpoint path is specified, a partial load is performed using a regex filter to select which parameters to restore.
Key considerations:
- Resuming training requires pointing logdir to the same directory as the original run
- Checkpoint incompatibility manifests as a Too many leaves for PyTreeDef error
- Partial checkpoint loading enables transfer learning from pretrained models
Step 6: Data Collection and Training Loop
Run the main training loop that interleaves environment interaction with model updates. A Driver manages parallel environment instances, collecting transitions by running the agent policy and forwarding each step to the replay buffer and logging callbacks. After each group of environment steps, the training function is called at a configured ratio (e.g., 32 gradient updates per environment step), sampling batches from replay, computing the combined world model and actor-critic loss, and updating all parameters.
Key considerations:
- The train_ratio controls the number of gradient steps per environment step
- Training does not begin until the replay buffer has accumulated at least one full batch
- Each training step processes a batch of sequences, computing world model reconstruction loss, RSSM dynamics loss, reward/continuation prediction loss, and actor-critic losses via imagined rollouts
Step 7: Logging and Reporting
Periodically write training metrics, episode statistics, and diagnostic reports. Metrics include training losses, FPS counters, replay buffer statistics, and system resource usage. Reporting runs the agent in evaluation mode on replay data to produce open-loop video predictions (observing real frames then imagining forward) which visualize world model quality.
Key considerations:
- Log outputs support JSONL files, Scope summaries, TensorBoard, and WandB
- Open-loop videos show ground truth, predicted frames, and error maps side by side
- The report frequency and log frequency are independently configurable