Heuristic:Google deepmind Dm control Tolerance Reward Tuning
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Reward_Shaping |
| Last Updated | 2026-02-15 05:00 GMT |
Overview
Parameter tuning guide for the `tolerance()` reward function, the primary reward shaping primitive used across all Control Suite and Composer tasks.
Description
The `tolerance()` function is dm_control's core reward primitive. It returns 1.0 when a value is within specified bounds and decays sigmoidally outside them. Nearly every Control Suite task composes rewards from one or more `tolerance()` calls. Understanding how to tune `bounds`, `margin`, `sigmoid`, and `value_at_margin` is essential for designing effective reward functions in custom tasks.
Usage
Apply this heuristic when designing reward functions for custom Composer tasks, debugging sparse vs dense reward issues, or adjusting reward sensitivity in existing tasks.
The Insight (Rule of Thumb)
- Action: Choose `margin` and `sigmoid` based on desired reward landscape shape.
- Key Parameters:
- `bounds=(lower, upper)`: Target interval. Returns 1.0 inside. Can be `(-inf, inf)` for unbounded sides or `(x, x)` for exact match.
- `margin=0`: Binary reward (1 inside bounds, 0 outside). Use for sparse rewards.
- `margin>0`: Smooth sigmoid decay. Set to the distance at which reward should equal `value_at_margin` (default 0.1). Use for dense/shaped rewards.
- `value_at_margin=0.1`: The reward value at exactly `margin` distance from bounds. Default 0.1 means reward drops to 10% at margin distance.
- `sigmoid='gaussian'`: Default choice. Smooth, symmetric decay. Other options: `'hyperbolic'`, `'long_tail'`, `'reciprocal'`, `'cosine'`, `'linear'`, `'quadratic'`, `'tanh_squared'`.
- Trade-off: Smaller margins create steeper reward gradients (faster learning near target, harder exploration). Larger margins provide more gradient signal at distance (better exploration, slower convergence near target).
- Sigmoid choice: `'gaussian'` for standard tasks. `'long_tail'` or `'reciprocal'` when distant states still need nonzero reward. `'linear'` for simple proportional decay. `'cosine'` for compact support (exactly 0 beyond margin).
Reasoning
The sigmoid scaling mechanism normalizes the distance-from-bounds by the margin, then applies the selected sigmoid function. The `value_at_1` parameter (internally derived from `value_at_margin`) sets the sigmoid's scale so that `sigmoid(1.0) = value_at_margin`. This means at distance == margin, reward == value_at_margin.
For tasks requiring the agent to maintain a value in a range (e.g., upright stance, target velocity), `margin` should correspond to the acceptable deviation beyond which performance is poor. For example, in the `walker:walk` task, the tolerance on horizontal velocity uses a margin matching the target speed to create a smooth reward gradient.
Code evidence from `utils/rewards.py:93-135`:
def tolerance(x, bounds=(0.0, 0.0), margin=0.0, sigmoid='gaussian',
value_at_margin=_DEFAULT_VALUE_AT_MARGIN):
"""Returns 1 when `x` falls inside the bounds, between 0 and 1 otherwise."""
lower, upper = bounds
if lower > upper:
raise ValueError('Lower bound must be <= upper bound.')
if margin < 0:
raise ValueError('`margin` must be non-negative.')
in_bounds = np.logical_and(lower <= x, x <= upper)
if margin == 0:
value = np.where(in_bounds, 1.0, 0.0)
else:
d = np.where(x < lower, lower - x, x - upper) / margin
value = np.where(in_bounds, 1.0, _sigmoids(d, value_at_margin, sigmoid))
return float(value) if np.isscalar(x) else value
Default value_at_margin from `utils/rewards.py:22`:
_DEFAULT_VALUE_AT_MARGIN = 0.1