Principle:Farama Foundation Gymnasium REINFORCE Policy Gradient

Knowledge Sources	REINFORCE Sutton and Barto RL Gymnasium Tutorials
Domains	Reinforcement_Learning, Policy_Gradient
Last Updated	2026-02-15 03:00 GMT

Overview

A Monte Carlo policy gradient algorithm that updates a parameterized policy by ascending the gradient of expected cumulative reward using complete episode returns.

Description

REINFORCE (Williams, 1992) is the simplest policy gradient algorithm. It collects complete episodes, computes discounted returns for each timestep, and updates policy parameters in the direction that increases the probability of actions that led to high returns.

Key characteristics:

Monte Carlo: Uses full episode returns, no bootstrapping
On-policy: Data must come from the current policy
High variance: Mitigated by subtracting a baseline (typically a value function)
Unbiased: The gradient estimate is unbiased

REINFORCE is foundational for understanding policy gradient methods. More advanced algorithms (A2C, PPO, TRPO) build upon its gradient estimate with variance reduction and trust region constraints.

Usage

Use REINFORCE for simple continuous or discrete control tasks, educational purposes, or as a baseline. For complex tasks requiring sample efficiency, prefer A2C or PPO with GAE.

Theoretical Basis

The policy gradient theorem: $\nabla_{θ} J (θ) = 𝔼_{π_{θ}} [\sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) \cdot G_{t}]$

Where the discounted return is: $G_{t} = \sum_{k = 0}^{T - t} γ^{k} r_{t + k}$

With a baseline $b (s_{t})$ for variance reduction: $\nabla_{θ} J (θ) = 𝔼 [\sum_{t} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) (G_{t} - b (s_{t}))]$

For continuous actions, the policy is typically a Gaussian: $π_{θ} (a | s) = 𝒩 (μ_{θ} (s), σ_{θ} (s))$

Related Pages

Implemented By

Implementation:Farama_Foundation_Gymnasium_REINFORCE_Update

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment