Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Farama Foundation Gymnasium Reward Transformation

From Leeroopedia
Knowledge Sources
Domains Reinforcement_Learning, Reward_Engineering
Last Updated 2026-02-15 03:00 GMT

Overview

Stateless reward transformation wrappers apply deterministic functions such as clipping and scaling to environment rewards without maintaining internal state.

Description

Reward transformation wrappers modify the reward signal produced by an environment before it reaches the learning algorithm. Unlike stateful normalization (which tracks running statistics), these transformations are pure functions of the current reward value. The most common transformations are clipping (bounding the reward within a specified range) and arbitrary functional transformations (applying a user-defined function). These wrappers are stateless, meaning their behavior depends only on the current input and fixed parameters, not on the history of past rewards.

Reward clipping bounds the reward to a specified interval, which prevents extreme reward values from destabilizing gradient-based learning. This was famously used in the original DQN paper, where Atari game rewards were clipped to [-1, 1] to normalize the reward scale across different games. Reward scaling (applying a multiplicative or additive transformation) adjusts the reward magnitude to match the learning algorithm's expected range. The general TransformReward wrapper accepts any callable function, enabling arbitrary reward shaping.

Both single-environment and vectorized versions of reward transformation wrappers are provided. The vectorized versions apply the transformation across all parallel environments simultaneously, which is essential for maintaining consistent reward processing in multi-environment training setups. The separation of stateless reward transformation from stateful reward normalization follows the single-responsibility principle, with each wrapper having a clear and predictable behavior.

Usage

Use reward clipping when training across environments with different reward scales (for example, Atari games) to prevent any single environment's reward magnitude from dominating learning. Use reward scaling to adjust reward magnitudes to a range that works well with the learning algorithm's hyperparameters. Use the general transform wrapper for custom reward shaping functions such as log transformations, sign functions, or piecewise mappings. Use the vector versions when training with multiple parallel environments.

Theoretical Basis

Reward transformation modifies the MDP reward function:

R~(s,a,s)=f(R(s,a,s))

where f is the transformation function. The agent optimizes the transformed objective:

J~(π)=𝔼π[t=0γtf(rt)]

Clipping:

fclip(r)=clip(r,rmin,rmax)=min(max(r,rmin),rmax)

Note that clipping changes the optimal policy when |r|>rmax, as it removes the distinction between different reward magnitudes beyond the clip boundary.

Linear transformation:

flinear(r)=ar+b

Linear scaling with b=0 preserves the optimal policy (it only changes the scale of value estimates). Adding a constant b0 changes the effective discount factor behavior.

General transformation:

class TransformReward(RewardWrapper):
    def __init__(self, env, func):
        self.func = func

    def reward(self, reward):
        return self.func(reward)

class ClipReward(TransformReward):
    def __init__(self, env, min_reward=-1.0, max_reward=1.0):
        func = lambda r: np.clip(r, min_reward, max_reward)
        super().__init__(env, func)

Potential-based reward shaping (for reference) preserves optimal policies:

fshaped(s,a,s)=R(s,a,s)+γΦ(s)Φ(s)

where Φ is a potential function. However, the wrappers here apply simpler transformations that depend only on the reward value, not on the state.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment