Workflow:Danijar Dreamerv3 Train And Evaluate

Knowledge Sources	DreamerV3 Mastering Diverse Domains through World Models DreamerV3 Project
Domains	Reinforcement_Learning, World_Models, Model_Based_RL
Last Updated	2026-02-15 09:00 GMT

Overview

End-to-end process for training a DreamerV3 agent with periodic evaluation episodes on separate environment instances for unbiased performance measurement.

Description

This workflow extends the standard single-process training pipeline by adding a dedicated evaluation loop. Separate training and evaluation environments run in parallel, each feeding into their own replay buffers. The training loop proceeds as normal, but at configurable intervals the system pauses to run complete evaluation episodes using the current policy (without exploration noise), collects episode scores, and generates diagnostic reports from both training and evaluation replay data. This separation ensures that evaluation metrics are not contaminated by exploration behavior or stale replay data.

Usage

Execute this workflow when you need rigorous evaluation metrics during training, such as for benchmark comparisons, hyperparameter sweeps, or paper results. Use this instead of the basic train mode when you want to track evaluation scores on fresh episodes at regular intervals, with evaluation episodes collected by dedicated environment instances that do not contribute to the training replay buffer.

Execution Steps

Step 1: Configuration Loading

Parse command-line arguments with --script train_eval and load the hierarchical YAML configuration. The merged config controls both training and evaluation parameters, including the number of evaluation environments (eval_envs) and evaluation episodes per reporting period (eval_eps).

Key considerations:

The script field must be set to train_eval to select this mode
Evaluation environment count and episode count are separate from training settings
All other config merging (presets, overrides) works identically to single-process training

Step 2: Dual Environment Construction

Instantiate separate training and evaluation environment pools. Both pools use the same environment suite and task but may have different internal configurations. The training environments collect data for the training replay buffer, while evaluation environments collect data for a separate evaluation replay buffer.

Key considerations:

Training and evaluation environments are created from independent factory functions
The same composable wrapper chain (normalize, unify, check, clip) is applied to both
Evaluation environments run the same task but their data is kept strictly separate

Step 3: Agent and Dual Replay Initialization

Construct the DreamerV3 agent and two replay buffers: one for training data and one for evaluation data. The evaluation replay buffer is typically smaller (one-tenth of training capacity). Three data streams are created: a training stream, a reporting stream (from training data), and an evaluation stream (from evaluation data).

Key considerations:

The agent is shared between training and evaluation (same parameters)
Evaluation replay capacity is automatically set to training capacity divided by 10
Separate stream iterators are maintained for training, reporting, and evaluation

Step 4: Checkpoint Restoration

Initialize or restore training state, including both replay buffers and the agent. The checkpoint saves the step counter, agent state, and both training and evaluation replay buffer contents.

Key considerations:

Both replay buffers are saved and restored together
The should_save clock is registered immediately after loading to avoid redundant saves

Step 5: Interleaved Training and Evaluation Loop

Run the main loop that alternates between training steps and evaluation periods. Training proceeds identically to single-process mode: the driver collects environment steps, feeds transitions to the training replay, and triggers gradient updates at the configured train ratio. At each reporting interval, the system runs a set number of complete evaluation episodes using the policy in eval mode, then generates reports from both training and evaluation replay data.

Key considerations:

Evaluation episodes use a separate policy mode that may differ in exploration behavior
The evaluation driver is reset before each evaluation period to start fresh episodes
Training and reporting metrics are logged under separate prefixes for disambiguation

Step 6: Metrics Logging and Checkpointing

Write training metrics, evaluation episode statistics, and diagnostic reports. The evaluation episodes produce episode scores and lengths under the epstats prefix. Diagnostic reports are generated from both training replay (report prefix) and evaluation replay (eval prefix), providing open-loop video predictions from both data sources.

Key considerations:

Evaluation metrics appear under the epstats prefix alongside training episode stats
Two separate report streams allow comparing world model quality on training vs. evaluation data
Checkpoints save the full state including both replay buffers

Execution Diagram

GitHub URL

Workflow Repository