Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Farama Foundation Gymnasium Observation Normalization

From Leeroopedia
Knowledge Sources
Domains Reinforcement_Learning, Training_Stability
Last Updated 2026-02-15 03:00 GMT

Overview

Running statistics-based normalization of observations and rewards stabilizes reinforcement learning training by maintaining approximately unit variance across input dimensions.

Description

Observation and reward normalization addresses a fundamental challenge in reinforcement learning: the scale mismatch between different observation dimensions and the non-stationarity of reward distributions during training. When observation features span vastly different ranges (for example, position in meters vs. angular velocity in radians per second), neural network function approximators can struggle to learn effectively. Similarly, raw reward magnitudes can vary by orders of magnitude across different environments or training phases, causing instability in value function estimation.

The normalization approach uses Welford's online algorithm for computing running mean and variance statistics. As the agent interacts with the environment, each observation is normalized by subtracting the running mean and dividing by the running standard deviation. For rewards, a discounted return estimator tracks the running variance of cumulative rewards, and the instantaneous reward is divided by the running standard deviation. Both wrappers support freezing the running statistics (disabling updates) during evaluation to prevent test-time distribution shift.

This normalization scheme operates as a transparent wrapper layer that sits between the environment and the learning algorithm. It maintains internal state (the running statistics) that evolves over the course of training. Both single-environment and vectorized (multi-environment) versions are provided, with the vector versions computing statistics across all parallel environments for more stable estimates. The wrapper utilities module provides the core RunningMeanStd class and the parallel algorithm for updating moments from batch statistics.

Usage

Use observation normalization when training RL agents on environments with heterogeneous observation scales or when neural network training is unstable. Use reward normalization when reward magnitudes vary significantly during training or across environments. Freeze the running statistics during evaluation by setting the update flag to False. Use the vector versions when training with multiple parallel environments for more stable running statistics. The RunningMeanStd utility class can also be used independently in custom normalization schemes.

Theoretical Basis

The normalization uses Welford's online parallel algorithm for computing running statistics. Given a stream of observations x1,x2,,xt, the running mean and variance are maintained as:

μt=μt1+xtμt1nt

σt2=(nt1)σt12+(xtμt1)(xtμt)nt

The normalized observation is then:

x^t=xtμtσt2+ϵ

where ϵ is a small constant (typically 108) for numerical stability.

For reward normalization, the wrapper tracks the discounted return:

Gt=rt+γGt1

and normalizes using the running variance of these returns:

r^t=rtVar(G)+ϵ

For merging batch statistics (used in vector environments), the parallel algorithm combines two sets of moments:

def update_mean_var_count(mean, var, count, batch_mean, batch_var, batch_count):
    delta = batch_mean - mean
    total_count = count + batch_count
    new_mean = mean + delta * batch_count / total_count
    m_a = var * count
    m_b = batch_var * batch_count
    M2 = m_a + m_b + delta**2 * count * batch_count / total_count
    new_var = M2 / total_count
    return new_mean, new_var, total_count

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment