Principle:LaurentMazare Tch rs REINFORCE Policy Gradient
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning, Deep Learning |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
REINFORCE is a Monte Carlo policy gradient algorithm that updates a parameterized policy by scaling the log-probability of sampled actions by their empirical returns.
Description
REINFORCE (also known as the Monte Carlo policy gradient method) is one of the earliest and most fundamental policy gradient algorithms. It directly optimizes a parameterized policy without requiring a model of the environment's dynamics.
The algorithm operates as follows:
- Trajectory sampling: The agent interacts with the environment for a complete episode, recording the sequence of states, actions, and rewards. This produces a trajectory .
- Return computation: For each time step in the trajectory, the discounted return is computed as the sum of discounted future rewards from that step onward. This is computed backward from the end of the episode for efficiency.
- Policy gradient estimation: The gradient of the expected return with respect to policy parameters is estimated using the log-probability trick. The key insight is that the gradient of the expected return can be written as an expectation of the product of the return and the gradient of the log-policy, which can be estimated from sampled trajectories.
- Parameter update: The policy parameters are updated in the direction of the estimated gradient, scaled by a learning rate. Actions that led to high returns become more probable, while actions that led to low returns become less probable.
A critical property of REINFORCE is that it is an unbiased estimator of the policy gradient, but it suffers from high variance because the return depends on the entire future trajectory. Common variance reduction techniques include subtracting a baseline (e.g., average return) and using reward normalization.
Usage
REINFORCE is applied in environments where only episodic interaction is available, as a baseline algorithm for policy gradient research, in settings with discrete action spaces (e.g., game playing), and as a pedagogical introduction to policy optimization methods.
Theoretical Basis
Objective:
Maximize the expected return under the policy:
Policy Gradient Theorem:
where the discounted return from step is:
Log-Probability Trick (Score Function Estimator):
The derivation relies on the identity:
which allows rewriting the gradient of an expectation as an expectation of a product.
REINFORCE Algorithm:
initialize policy parameters theta
for each episode:
generate trajectory (s_0, a_0, r_0, ..., s_T) using pi(theta)
for t = T-1 down to 0:
G_t := r_t + gamma * G_{t+1} (with G_T = 0)
for t = 0 to T-1:
theta := theta + alpha * gamma^t * G_t * grad(log pi(a_t | s_t; theta))
Variance Reduction with Baseline:
Subtracting a state-dependent baseline from the return does not change the expected gradient but reduces variance:
The optimal baseline is , though in practice a running average of returns or a learned value function is used.