Principle:Danijar Dreamerv3 Random Baseline Agent
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Baseline, Evaluation |
| Last Updated | 2026-02-15 09:00 GMT |
Overview
Strategy of using a uniformly random action policy as a lower-bound baseline for RL evaluation and as a data collection mechanism for initial replay buffer population.
Description
A Random Baseline Agent selects actions uniformly at random from the action space without any learning or state estimation. It serves two key roles in reinforcement learning pipelines:
- Performance baseline: The random agent establishes the minimum expected return on any task. All learned agents should exceed this lower bound. Benchmark normalization (e.g., human-normalized scores) uses the random agent score as the denominator floor.
- Initial data collection: Before a learned agent has meaningful parameters, random exploration is used to populate the replay buffer with diverse transitions. This ensures that the first training batches contain varied experiences rather than degenerate sequences.
The random agent must conform to the same interface as the learned agent so it can be substituted transparently in the training pipeline. This includes implementing all lifecycle methods (init_policy, train, report, save, load) as no-ops.
Usage
Use this principle when you need a lower-bound baseline for benchmark evaluation or when populating a replay buffer with initial random exploration data before the learned agent begins training. The random agent should be interchangeable with any learned agent in the run loop.
Theoretical Basis
The random policy is defined as:
Where A is the action space. For continuous action spaces, the policy samples uniformly from the space bounds. For discrete action spaces, each action is equally likely.
Expected return under random policy:
The expected return under a random policy provides the normalization baseline:
This value varies per environment and is typically pre-computed for standard benchmarks (stored in baselines.yaml for DreamerV3).
Pseudo-code:
# Abstract random agent algorithm
for each step:
action = uniform_sample(action_space)
obs, reward, done = env.step(action)
replay_buffer.add(obs, action, reward, done)
# No learning occurs — all train/report calls are no-ops