Principle:Danijar Dreamerv3 Checkpoint Management
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Training_Infrastructure |
| Last Updated | 2026-02-15 09:00 GMT |
Overview
A persistence mechanism that saves and restores the complete training state (agent parameters, replay buffer, step counter) to enable fault-tolerant training and pretrained model evaluation.
Description
Checkpoint Management in DreamerV3 provides two core operations: save (serialize all registered state to disk) and load_or_save (restore from an existing checkpoint if present, otherwise save the initial state). This enables:
- Fault tolerance: Training can resume from the last checkpoint after crashes or preemption
- Evaluation: Pretrained agents can be loaded for evaluation-only runs
- Transfer learning: Selective loading of parameters from a pretrained checkpoint via regex filtering
The checkpoint system is component-based: each component (agent, replay buffer, step counter) is registered by name on the checkpoint object. The from_checkpoint option allows loading from a different checkpoint path with optional regex-based parameter filtering.
Usage
Use this principle after agent and replay initialization but before the training loop begins. Checkpoints are saved periodically during training (controlled by save_every config). For evaluation-only runs, checkpoint loading is mandatory (the agent has random parameters until loaded).
Theoretical Basis
Pseudo-code Logic:
# Abstract algorithm
checkpoint = Checkpoint(path)
checkpoint.register("agent", agent)
checkpoint.register("replay", replay)
checkpoint.register("step", step_counter)
if checkpoint_exists(path):
checkpoint.load() # Restore all registered components
else:
checkpoint.save() # Save initial state as baseline
# Optionally load from a different pretrained checkpoint:
if from_checkpoint:
checkpoint.load(from_checkpoint, keys=["agent"], regex=filter_pattern)