Principle:LaurentMazare Tch rs REINFORCE Policy Gradient

Knowledge Sources	LaurentMazare_Tch_rs Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
Domains	Reinforcement Learning, Deep Learning
Last Updated	2026-02-08 00:00 GMT

Overview

REINFORCE is a Monte Carlo policy gradient algorithm that updates a parameterized policy by scaling the log-probability of sampled actions by their empirical returns.

Description

REINFORCE (also known as the Monte Carlo policy gradient method) is one of the earliest and most fundamental policy gradient algorithms. It directly optimizes a parameterized policy $π (a | s; θ)$ without requiring a model of the environment's dynamics.

The algorithm operates as follows:

Trajectory sampling: The agent interacts with the environment for a complete episode, recording the sequence of states, actions, and rewards. This produces a trajectory $τ = (s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, \dots, s_{T})$ .

Return computation: For each time step $t$ in the trajectory, the discounted return $G_{t}$ is computed as the sum of discounted future rewards from that step onward. This is computed backward from the end of the episode for efficiency.

Policy gradient estimation: The gradient of the expected return with respect to policy parameters is estimated using the log-probability trick. The key insight is that the gradient of the expected return can be written as an expectation of the product of the return and the gradient of the log-policy, which can be estimated from sampled trajectories.

Parameter update: The policy parameters are updated in the direction of the estimated gradient, scaled by a learning rate. Actions that led to high returns become more probable, while actions that led to low returns become less probable.

A critical property of REINFORCE is that it is an unbiased estimator of the policy gradient, but it suffers from high variance because the return depends on the entire future trajectory. Common variance reduction techniques include subtracting a baseline (e.g., average return) and using reward normalization.

Usage

REINFORCE is applied in environments where only episodic interaction is available, as a baseline algorithm for policy gradient research, in settings with discrete action spaces (e.g., game playing), and as a pedagogical introduction to policy optimization methods.

Theoretical Basis

Objective:

Maximize the expected return under the policy:

$J (θ) = 𝔼_{τ \sim π_{θ}} [\sum_{t = 0}^{T} γ^{t} r_{t}]$

Policy Gradient Theorem:

$\nabla_{θ} J (θ) = 𝔼_{τ \sim π_{θ}} [\sum_{t = 0}^{T} \nabla_{θ} \log π (a_{t} | s_{t}; θ) \cdot G_{t}]$

where the discounted return from step $t$ is:

$G_{t} = \sum_{k = 0}^{T - t} γ^{k} r_{t + k}$

Log-Probability Trick (Score Function Estimator):

The derivation relies on the identity:

$\nabla_{θ} π (a | s; θ) = π (a | s; θ) \nabla_{θ} \log π (a | s; θ)$

which allows rewriting the gradient of an expectation as an expectation of a product.

REINFORCE Algorithm:

initialize policy parameters theta
for each episode:
    generate trajectory (s_0, a_0, r_0, ..., s_T) using pi(theta)
    for t = T-1 down to 0:
        G_t := r_t + gamma * G_{t+1}    (with G_T = 0)
    for t = 0 to T-1:
        theta := theta + alpha * gamma^t * G_t * grad(log pi(a_t | s_t; theta))

Variance Reduction with Baseline:

Subtracting a state-dependent baseline $b (s_{t})$ from the return does not change the expected gradient but reduces variance:

$\nabla_{θ} J (θ) = 𝔼 [\sum_{t = 0}^{T} \nabla_{θ} \log π (a_{t} | s_{t}; θ) \cdot (G_{t} - b (s_{t}))]$

The optimal baseline is $b^{*} (s) = \frac{𝔼 [‖ \nabla \log π ‖^{2} G]}{𝔼 [‖ \nabla \log π ‖^{2}]}$ , though in practice a running average of returns or a learned value function is used.

Related Pages

Implementation:LaurentMazare_Tch_rs_Policy_Gradient

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment