Heuristic:Danijar Dreamerv3 Symlog TwoHot Prediction

Knowledge Sources	Mastering Diverse Domains through World Models DreamerV3
Domains	Reinforcement_Learning, Model_Based_RL, Optimization
Last Updated	2026-02-15 09:00 GMT

Overview

Scale-free prediction technique using symlog/symexp transforms with TwoHot discrete regression for reward and value outputs, enabling fixed hyperparameters across diverse reward scales.

Description

DreamerV3 uses a combination of two techniques for scale-invariant prediction of rewards and values:

Symlog/Symexp transforms: The `symlog(x) = sign(x) * log(1 + |x|)` function compresses large magnitudes while preserving signs and zero. Its inverse `symexp(x) = sign(x) * (exp(|x|) - 1)` recovers the original scale. This squashing is applied to regression targets before discretization.

TwoHot discrete regression: Instead of predicting a scalar directly, the network outputs a categorical distribution over a fixed set of bins (default 255). The target is encoded as a soft two-hot vector, placing weight on the two nearest bins proportionally to distance. The prediction is the weighted average of bin values (with a symmetric sum implementation to ensure zero-initialization produces zero predictions).

This combination allows a single set of hyperparameters to work across environments with reward scales differing by orders of magnitude (e.g., Atari scores 0-100 vs Minecraft sparse 0-1).

Usage

Applied automatically to the reward head and value head when configured with `output: symexp_twohot` (the default in configs.yaml). The encoder also uses `symlog` to compress continuous vector observations before feeding them to the MLP encoder.

The Insight (Rule of Thumb)

Action: Use `symexp_twohot` output type for reward and value predictions. Use `symlog` for encoding continuous observations.
Value: Default 255 bins, symlog squashing for targets, symexp unsquashing for predictions.
Trade-off: Slight increase in computation (softmax over 255 classes vs single scalar) in exchange for robust scale-invariant predictions that eliminate the need for reward normalization or clipping.
Compatibility: The TwoHot `pred()` method uses a symmetric sum to avoid numerical drift — naive left-to-right summation would produce non-zero predictions even with uniform probabilities over symmetric bins.

Reasoning

Traditional scalar regression is sensitive to reward magnitude. An MSE loss with rewards of magnitude 1000 will dominate over rewards of magnitude 0.01, requiring per-environment tuning. Symlog compression maps both scales to a similar range (symlog(1000) ≈ 6.9, symlog(0.01) ≈ 0.01). The discrete TwoHot parameterization avoids the difficulties of heteroscedastic regression while providing a rich multi-modal distribution.

The symmetric sum in TwoHot.pred() (embodied/jax/outs.py:L292-309) is a subtle numerical fix: with N odd bins, the implementation splits into left half, center, and right half, summing `(p_left * b_left)[::-1] + (p_right * b_right)` to cancel floating-point errors. Without this, uniform initialization produces small non-zero value predictions that bias early training.

# From embodied/jax/outs.py:L292-309
n = self.logits.shape[-1]
if n % 2 == 1:
    m = (n - 1) // 2
    p1 = self.probs[..., :m]
    p2 = self.probs[..., m: m + 1]
    p3 = self.probs[..., m + 1:]
    b1 = self.bins[..., :m]
    b2 = self.bins[..., m: m + 1]
    b3 = self.bins[..., m + 1:]
    wavg = (p2 * b2).sum(-1) + ((p1 * b1)[..., ::-1] + (p3 * b3)).sum(-1)
    return self.unsquash(wavg)

Symlog/symexp from `embodied/jax/nets.py:L59-64`:

def symlog(x):
  return jnp.sign(x) * jnp.log1p(jnp.abs(x))

def symexp(x):
  return jnp.sign(x) * jnp.expm1(jnp.abs(x))

Default config from `dreamerv3/configs.yaml:L98-101`:

rewhead: {layers: 1, units: 1024, output: symexp_twohot, bins: 255}
value: {layers: 3, units: 1024, output: symexp_twohot, bins: 255}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment