Workflow:Danijar Dreamerv3 Train And Evaluate
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, World_Models, Model_Based_RL |
| Last Updated | 2026-02-15 09:00 GMT |
Overview
End-to-end process for training a DreamerV3 agent with periodic evaluation episodes on separate environment instances for unbiased performance measurement.
Description
This workflow extends the standard single-process training pipeline by adding a dedicated evaluation loop. Separate training and evaluation environments run in parallel, each feeding into their own replay buffers. The training loop proceeds as normal, but at configurable intervals the system pauses to run complete evaluation episodes using the current policy (without exploration noise), collects episode scores, and generates diagnostic reports from both training and evaluation replay data. This separation ensures that evaluation metrics are not contaminated by exploration behavior or stale replay data.
Usage
Execute this workflow when you need rigorous evaluation metrics during training, such as for benchmark comparisons, hyperparameter sweeps, or paper results. Use this instead of the basic train mode when you want to track evaluation scores on fresh episodes at regular intervals, with evaluation episodes collected by dedicated environment instances that do not contribute to the training replay buffer.
Execution Steps
Step 1: Configuration Loading
Parse command-line arguments with --script train_eval and load the hierarchical YAML configuration. The merged config controls both training and evaluation parameters, including the number of evaluation environments (eval_envs) and evaluation episodes per reporting period (eval_eps).
Key considerations:
- The script field must be set to train_eval to select this mode
- Evaluation environment count and episode count are separate from training settings
- All other config merging (presets, overrides) works identically to single-process training
Step 2: Dual Environment Construction
Instantiate separate training and evaluation environment pools. Both pools use the same environment suite and task but may have different internal configurations. The training environments collect data for the training replay buffer, while evaluation environments collect data for a separate evaluation replay buffer.
Key considerations:
- Training and evaluation environments are created from independent factory functions
- The same composable wrapper chain (normalize, unify, check, clip) is applied to both
- Evaluation environments run the same task but their data is kept strictly separate
Step 3: Agent and Dual Replay Initialization
Construct the DreamerV3 agent and two replay buffers: one for training data and one for evaluation data. The evaluation replay buffer is typically smaller (one-tenth of training capacity). Three data streams are created: a training stream, a reporting stream (from training data), and an evaluation stream (from evaluation data).
Key considerations:
- The agent is shared between training and evaluation (same parameters)
- Evaluation replay capacity is automatically set to training capacity divided by 10
- Separate stream iterators are maintained for training, reporting, and evaluation
Step 4: Checkpoint Restoration
Initialize or restore training state, including both replay buffers and the agent. The checkpoint saves the step counter, agent state, and both training and evaluation replay buffer contents.
Key considerations:
- Both replay buffers are saved and restored together
- The should_save clock is registered immediately after loading to avoid redundant saves
Step 5: Interleaved Training and Evaluation Loop
Run the main loop that alternates between training steps and evaluation periods. Training proceeds identically to single-process mode: the driver collects environment steps, feeds transitions to the training replay, and triggers gradient updates at the configured train ratio. At each reporting interval, the system runs a set number of complete evaluation episodes using the policy in eval mode, then generates reports from both training and evaluation replay data.
Key considerations:
- Evaluation episodes use a separate policy mode that may differ in exploration behavior
- The evaluation driver is reset before each evaluation period to start fresh episodes
- Training and reporting metrics are logged under separate prefixes for disambiguation
Step 6: Metrics Logging and Checkpointing
Write training metrics, evaluation episode statistics, and diagnostic reports. The evaluation episodes produce episode scores and lengths under the epstats prefix. Diagnostic reports are generated from both training replay (report prefix) and evaluation replay (eval prefix), providing open-loop video predictions from both data sources.
Key considerations:
- Evaluation metrics appear under the epstats prefix alongside training episode stats
- Two separate report streams allow comparing world model quality on training vs. evaluation data
- Checkpoints save the full state including both replay buffers