Principle:Google deepmind Dm control Reward Shaping
| Metadata | |
|---|---|
| Knowledge Sources | dm_control |
| Domains | Reinforcement Learning, Reward Engineering, Control Theory |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Reward shaping is the principle of designing smooth, bounded reward signals that provide informative gradients to a learning agent by mapping a continuous distance metric to a value between 0 and 1 using configurable sigmoid functions.
Description
In reinforcement learning, the reward function drives all learning. A naive binary reward (1 if the goal is reached, 0 otherwise) provides no gradient information when the agent is far from the goal, making exploration extremely difficult. Reward shaping addresses this by replacing the binary signal with a smooth function that:
- Returns 1.0 when the agent's state falls within a specified target interval (the bounds).
- Decays smoothly towards 0.0 as the agent moves away from the target, at a rate controlled by a margin parameter.
- Uses a configurable sigmoid shape to control the decay profile.
The key design parameters are:
- Bounds -- a pair
(lower, upper)defining the interval within which the reward is maximal. Whenlower == upper, the target is an exact value. - Margin -- the distance from the bounds at which the reward drops to a specified reference value. A margin of 0 produces a hard threshold; a positive margin produces a smooth transition.
- Sigmoid type -- the mathematical function used for the decay. Different sigmoids offer different trade-offs between tail behaviour and gradient strength.
- Value at margin -- the reward value when the distance from the bounds exactly equals the margin, anchoring the sigmoid's scale.
Usage
Reward shaping is used in every manipulation task to convert a distance measure (e.g. Euclidean distance from the hand to a target) into a dense reward signal. It is also applicable outside manipulation, in any domain where a continuous metric exists between the current state and a goal state.
Theoretical Basis
The tolerance function is defined piecewise:
tolerance(x, bounds=(lower, upper), margin, sigmoid, value_at_margin):
if lower <= x <= upper:
return 1.0
d = distance_to_nearest_bound(x) / margin
return sigmoid_function(d, value_at_margin)
The sigmoid function maps the normalised distance d (where d = 1 at the margin) to a value in [0, 1]. The available sigmoids and their formulas are:
| Sigmoid | Formula | Tail Behaviour |
|---|---|---|
| Gaussian | exp(-0.5 * (d * scale)^2) |
Fast decay (exponential) |
| Hyperbolic | 1 / cosh(d * scale) |
Moderate decay |
| Long tail | 1 / ((d * scale)^2 + 1) |
Slow decay (polynomial) |
| Reciprocal | d| * scale + 1) | Slow decay (linear denominator) |
| Cosine | (1 + cos(pi * d * scale)) / 2 |
Compact support (reaches 0) |
| Linear | 1 - d * scale |
Compact support (reaches 0) |
| Quadratic | 1 - (d * scale)^2 |
Compact support (reaches 0) |
| Tanh squared | 1 - tanh(d * scale)^2 |
Moderate decay |
In each case, scale is derived from value_at_margin so that sigmoid_function(1, value_at_margin) = value_at_margin.