Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:LaurentMazare Tch rs REINFORCE Policy Gradient

From Leeroopedia


Knowledge Sources
Domains Reinforcement Learning, Deep Learning
Last Updated 2026-02-08 00:00 GMT

Overview

REINFORCE is a Monte Carlo policy gradient algorithm that updates a parameterized policy by scaling the log-probability of sampled actions by their empirical returns.

Description

REINFORCE (also known as the Monte Carlo policy gradient method) is one of the earliest and most fundamental policy gradient algorithms. It directly optimizes a parameterized policy π(a|s;θ) without requiring a model of the environment's dynamics.

The algorithm operates as follows:

  • Trajectory sampling: The agent interacts with the environment for a complete episode, recording the sequence of states, actions, and rewards. This produces a trajectory τ=(s0,a0,r0,s1,a1,r1,,sT).
  • Return computation: For each time step t in the trajectory, the discounted return Gt is computed as the sum of discounted future rewards from that step onward. This is computed backward from the end of the episode for efficiency.
  • Policy gradient estimation: The gradient of the expected return with respect to policy parameters is estimated using the log-probability trick. The key insight is that the gradient of the expected return can be written as an expectation of the product of the return and the gradient of the log-policy, which can be estimated from sampled trajectories.
  • Parameter update: The policy parameters are updated in the direction of the estimated gradient, scaled by a learning rate. Actions that led to high returns become more probable, while actions that led to low returns become less probable.

A critical property of REINFORCE is that it is an unbiased estimator of the policy gradient, but it suffers from high variance because the return depends on the entire future trajectory. Common variance reduction techniques include subtracting a baseline (e.g., average return) and using reward normalization.

Usage

REINFORCE is applied in environments where only episodic interaction is available, as a baseline algorithm for policy gradient research, in settings with discrete action spaces (e.g., game playing), and as a pedagogical introduction to policy optimization methods.

Theoretical Basis

Objective:

Maximize the expected return under the policy:

J(θ)=𝔼τπθ[t=0Tγtrt]

Policy Gradient Theorem:

θJ(θ)=𝔼τπθ[t=0Tθlogπ(at|st;θ)Gt]

where the discounted return from step t is:

Gt=k=0Ttγkrt+k

Log-Probability Trick (Score Function Estimator):

The derivation relies on the identity:

θπ(a|s;θ)=π(a|s;θ)θlogπ(a|s;θ)

which allows rewriting the gradient of an expectation as an expectation of a product.

REINFORCE Algorithm:

initialize policy parameters theta
for each episode:
    generate trajectory (s_0, a_0, r_0, ..., s_T) using pi(theta)
    for t = T-1 down to 0:
        G_t := r_t + gamma * G_{t+1}    (with G_T = 0)
    for t = 0 to T-1:
        theta := theta + alpha * gamma^t * G_t * grad(log pi(a_t | s_t; theta))

Variance Reduction with Baseline:

Subtracting a state-dependent baseline b(st) from the return does not change the expected gradient but reduces variance:

θJ(θ)=𝔼[t=0Tθlogπ(at|st;θ)(Gtb(st))]

The optimal baseline is b*(s)=𝔼[logπ2G]𝔼[logπ2], though in practice a running average of returns or a learned value function is used.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment