Principle:OpenRLHF OpenRLHF KL Divergence Estimation

Knowledge Sources	Approximating KL Divergence High-dimensional continuous control using generalized advantage estimation
Domains	Reinforcement_Learning, Loss_Functions
Last Updated	2026-02-07 00:00 GMT

Overview

An approximation technique for estimating KL divergence between policy distributions from log-probability samples, used as a penalty in RLHF training.

Description

KL Divergence Estimation computes an approximate KL divergence between the current policy and a reference policy using only sampled log-probabilities (no full distribution access needed). This is used as a regularization penalty in PPO/GRPO training to prevent the policy from diverging too far from the reference model, maintaining response quality.

Three estimators are available with different bias-variance tradeoffs.

Usage

Used in PPO and Math-GRPO workflows during reward computation. The KL penalty is added to the per-token reward signal.

Theoretical Basis

k1 estimator (standard): Simple log-ratio $K L_{k 1} = \log \frac{π_{θ} (a | s)}{π_{r e f} (a | s)}$

k2 estimator (non-negative): Squared log-ratio $K L_{k 2} = \frac{1}{2} {(\log \frac{π_{θ} (a | s)}{π_{r e f} (a | s)})}^{2}$

k3 estimator (non-negative, Schulman): $K L_{k 3} = \frac{π_{r e f} (a | s)}{π_{θ} (a | s)} - 1 - \log \frac{π_{r e f} (a | s)}{π_{θ} (a | s)}$

All estimators are clamped to [-10, 10] for numerical stability.

Related Pages

Implemented By

Implementation:OpenRLHF_OpenRLHF_Compute_approx_kl

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment