Principle:Farama Foundation Gymnasium Reward Transformation
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Reward_Engineering |
| Last Updated | 2026-02-15 03:00 GMT |
Overview
Stateless reward transformation wrappers apply deterministic functions such as clipping and scaling to environment rewards without maintaining internal state.
Description
Reward transformation wrappers modify the reward signal produced by an environment before it reaches the learning algorithm. Unlike stateful normalization (which tracks running statistics), these transformations are pure functions of the current reward value. The most common transformations are clipping (bounding the reward within a specified range) and arbitrary functional transformations (applying a user-defined function). These wrappers are stateless, meaning their behavior depends only on the current input and fixed parameters, not on the history of past rewards.
Reward clipping bounds the reward to a specified interval, which prevents extreme reward values from destabilizing gradient-based learning. This was famously used in the original DQN paper, where Atari game rewards were clipped to [-1, 1] to normalize the reward scale across different games. Reward scaling (applying a multiplicative or additive transformation) adjusts the reward magnitude to match the learning algorithm's expected range. The general TransformReward wrapper accepts any callable function, enabling arbitrary reward shaping.
Both single-environment and vectorized versions of reward transformation wrappers are provided. The vectorized versions apply the transformation across all parallel environments simultaneously, which is essential for maintaining consistent reward processing in multi-environment training setups. The separation of stateless reward transformation from stateful reward normalization follows the single-responsibility principle, with each wrapper having a clear and predictable behavior.
Usage
Use reward clipping when training across environments with different reward scales (for example, Atari games) to prevent any single environment's reward magnitude from dominating learning. Use reward scaling to adjust reward magnitudes to a range that works well with the learning algorithm's hyperparameters. Use the general transform wrapper for custom reward shaping functions such as log transformations, sign functions, or piecewise mappings. Use the vector versions when training with multiple parallel environments.
Theoretical Basis
Reward transformation modifies the MDP reward function:
where is the transformation function. The agent optimizes the transformed objective:
Clipping:
Note that clipping changes the optimal policy when , as it removes the distinction between different reward magnitudes beyond the clip boundary.
Linear transformation:
Linear scaling with preserves the optimal policy (it only changes the scale of value estimates). Adding a constant changes the effective discount factor behavior.
General transformation:
class TransformReward(RewardWrapper):
def __init__(self, env, func):
self.func = func
def reward(self, reward):
return self.func(reward)
class ClipReward(TransformReward):
def __init__(self, env, min_reward=-1.0, max_reward=1.0):
func = lambda r: np.clip(r, min_reward, max_reward)
super().__init__(env, func)
Potential-based reward shaping (for reference) preserves optimal policies:
where is a potential function. However, the wrappers here apply simpler transformations that depend only on the reward value, not on the state.