Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Farama Foundation Gymnasium REINFORCE Policy Gradient

From Leeroopedia
Knowledge Sources
Domains Reinforcement_Learning, Policy_Gradient
Last Updated 2026-02-15 03:00 GMT

Overview

A Monte Carlo policy gradient algorithm that updates a parameterized policy by ascending the gradient of expected cumulative reward using complete episode returns.

Description

REINFORCE (Williams, 1992) is the simplest policy gradient algorithm. It collects complete episodes, computes discounted returns for each timestep, and updates policy parameters in the direction that increases the probability of actions that led to high returns.

Key characteristics:

  • Monte Carlo: Uses full episode returns, no bootstrapping
  • On-policy: Data must come from the current policy
  • High variance: Mitigated by subtracting a baseline (typically a value function)
  • Unbiased: The gradient estimate is unbiased

REINFORCE is foundational for understanding policy gradient methods. More advanced algorithms (A2C, PPO, TRPO) build upon its gradient estimate with variance reduction and trust region constraints.

Usage

Use REINFORCE for simple continuous or discrete control tasks, educational purposes, or as a baseline. For complex tasks requiring sample efficiency, prefer A2C or PPO with GAE.

Theoretical Basis

The policy gradient theorem: θJ(θ)=𝔼πθ[t=0Tθlogπθ(at|st)Gt]

Where the discounted return is: Gt=k=0Ttγkrt+k

With a baseline b(st) for variance reduction: θJ(θ)=𝔼[tθlogπθ(at|st)(Gtb(st))]

For continuous actions, the policy is typically a Gaussian: πθ(a|s)=𝒩(μθ(s),σθ(s))

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment