Principle:Farama Foundation Gymnasium REINFORCE Policy Gradient
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Policy_Gradient |
| Last Updated | 2026-02-15 03:00 GMT |
Overview
A Monte Carlo policy gradient algorithm that updates a parameterized policy by ascending the gradient of expected cumulative reward using complete episode returns.
Description
REINFORCE (Williams, 1992) is the simplest policy gradient algorithm. It collects complete episodes, computes discounted returns for each timestep, and updates policy parameters in the direction that increases the probability of actions that led to high returns.
Key characteristics:
- Monte Carlo: Uses full episode returns, no bootstrapping
- On-policy: Data must come from the current policy
- High variance: Mitigated by subtracting a baseline (typically a value function)
- Unbiased: The gradient estimate is unbiased
REINFORCE is foundational for understanding policy gradient methods. More advanced algorithms (A2C, PPO, TRPO) build upon its gradient estimate with variance reduction and trust region constraints.
Usage
Use REINFORCE for simple continuous or discrete control tasks, educational purposes, or as a baseline. For complex tasks requiring sample efficiency, prefer A2C or PPO with GAE.
Theoretical Basis
The policy gradient theorem:
Where the discounted return is:
With a baseline for variance reduction:
For continuous actions, the policy is typically a Gaussian: