Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Google deepmind Dm control Reward Shaping

From Leeroopedia
Metadata
Knowledge Sources dm_control
Domains Reinforcement Learning, Reward Engineering, Control Theory
Last Updated 2026-02-15 00:00 GMT

Overview

Reward shaping is the principle of designing smooth, bounded reward signals that provide informative gradients to a learning agent by mapping a continuous distance metric to a value between 0 and 1 using configurable sigmoid functions.

Description

In reinforcement learning, the reward function drives all learning. A naive binary reward (1 if the goal is reached, 0 otherwise) provides no gradient information when the agent is far from the goal, making exploration extremely difficult. Reward shaping addresses this by replacing the binary signal with a smooth function that:

  • Returns 1.0 when the agent's state falls within a specified target interval (the bounds).
  • Decays smoothly towards 0.0 as the agent moves away from the target, at a rate controlled by a margin parameter.
  • Uses a configurable sigmoid shape to control the decay profile.

The key design parameters are:

  • Bounds -- a pair (lower, upper) defining the interval within which the reward is maximal. When lower == upper, the target is an exact value.
  • Margin -- the distance from the bounds at which the reward drops to a specified reference value. A margin of 0 produces a hard threshold; a positive margin produces a smooth transition.
  • Sigmoid type -- the mathematical function used for the decay. Different sigmoids offer different trade-offs between tail behaviour and gradient strength.
  • Value at margin -- the reward value when the distance from the bounds exactly equals the margin, anchoring the sigmoid's scale.

Usage

Reward shaping is used in every manipulation task to convert a distance measure (e.g. Euclidean distance from the hand to a target) into a dense reward signal. It is also applicable outside manipulation, in any domain where a continuous metric exists between the current state and a goal state.

Theoretical Basis

The tolerance function is defined piecewise:

tolerance(x, bounds=(lower, upper), margin, sigmoid, value_at_margin):

    if lower <= x <= upper:
        return 1.0

    d = distance_to_nearest_bound(x) / margin

    return sigmoid_function(d, value_at_margin)

The sigmoid function maps the normalised distance d (where d = 1 at the margin) to a value in [0, 1]. The available sigmoids and their formulas are:

Sigmoid Formula Tail Behaviour
Gaussian exp(-0.5 * (d * scale)^2) Fast decay (exponential)
Hyperbolic 1 / cosh(d * scale) Moderate decay
Long tail 1 / ((d * scale)^2 + 1) Slow decay (polynomial)
Reciprocal d| * scale + 1) Slow decay (linear denominator)
Cosine (1 + cos(pi * d * scale)) / 2 Compact support (reaches 0)
Linear 1 - d * scale Compact support (reaches 0)
Quadratic 1 - (d * scale)^2 Compact support (reaches 0)
Tanh squared 1 - tanh(d * scale)^2 Moderate decay

In each case, scale is derived from value_at_margin so that sigmoid_function(1, value_at_margin) = value_at_margin.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment