Principle:Danijar Dreamerv3 Data Collection And Training

Knowledge Sources	Mastering Diverse Domains through World Models Dream to Control: Learning Behaviors by Latent Imagination DreamerV3
Domains	Reinforcement_Learning, Model_Based_RL, World_Models
Last Updated	2026-02-15 09:00 GMT

Overview

The core training loop that interleaves environment data collection with world model learning and policy optimization through latent imagination, implementing the complete DreamerV3 algorithm.

Description

Data Collection and Training is the central computational loop of DreamerV3. It operates in two alternating phases:

Phase 1 — Data Collection: A Driver steps multiple parallel environments using the current policy, collecting observations and inserting transitions into the replay buffer.

Phase 2 — Model Training: Batches are sampled from the replay buffer and processed through the agent's train() method, which executes:

World Model Learning: The RSSM observes real sequences — the encoder maps observations to tokens, the RSSM processes them to produce posterior states, and losses are computed for reconstruction (decoder), reward prediction, continue prediction, and KL divergence between posterior and prior.
Imagination: Starting from observed states, the RSSM imagines future trajectories using the current policy (without environment interaction) for H steps (default 15).
Actor-Critic Optimization: Lambda returns are computed over imagined trajectories. The policy is updated to maximize advantages; the value function is updated to predict returns. A slow EMA target network stabilizes training.

The train-to-data ratio is controlled by train_ratio, determining how many gradient steps occur per environment step.

Usage

This is the main computational phase of DreamerV3 training. It begins after configuration, environment construction, agent initialization, replay buffer setup, and checkpoint restoration are complete.

Theoretical Basis

The DreamerV3 training objective combines world model learning and policy optimization:

World Model Loss: $ℒ_{W M} = ℒ_{r e c} + β_{d y n} ℒ_{d y n} + β_{r e p} ℒ_{r e p} + ℒ_{r e w} + ℒ_{c o n}$

Where:

Failed to parse (syntax error): {\displaystyle \mathcal{L}_{dyn} = \max(\text{KL}[\text{sg}(q) \| p], \text{free\_nats})}
Failed to parse (syntax error): {\displaystyle \mathcal{L}_{rep} = \max(\text{KL}[q \| \text{sg}(p)], \text{free\_nats})}

Imagination and Actor-Critic: $V_{λ}^{t} = r_{t} + γ ((1 - λ) V (s_{t + 1}) + λ V_{λ}^{t + 1})$

$ℒ_{π} = - 𝔼 [\sum_{t} w_{t} (\log π (a_{t} | s_{t}) \cdot sg (norm (V_{λ}^{t} - V (s_{t}))) + η H [π])]$

Pseudo-code Logic:

# Abstract algorithm
while step < total_steps:
    # Collect data
    driver(policy, steps=10)
    # Train on replay
    for _ in range(train_ratio_steps):
        batch = next(replay_stream)
        carry, outs, metrics = agent.train(carry, batch)
        # agent.train internally:
        #   1. encoder(obs) -> tokens
        #   2. rssm.observe(tokens, actions) -> posterior states
        #   3. compute reconstruction, reward, continue, KL losses
        #   4. rssm.imagine(policy, H steps) -> imagined states
        #   5. compute lambda returns over imagined rewards
        #   6. policy loss = -advantage * log_prob - entropy_bonus
        #   7. value loss = (value - sg(target))^2 + slow_regularization
        #   8. optimizer step on total loss

Related Pages

Implemented By

Implementation:Danijar_Dreamerv3_Train_Loop

Uses Heuristics

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment