Principle:OpenRLHF OpenRLHF KL Divergence Estimation
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Loss_Functions |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
An approximation technique for estimating KL divergence between policy distributions from log-probability samples, used as a penalty in RLHF training.
Description
KL Divergence Estimation computes an approximate KL divergence between the current policy and a reference policy using only sampled log-probabilities (no full distribution access needed). This is used as a regularization penalty in PPO/GRPO training to prevent the policy from diverging too far from the reference model, maintaining response quality.
Three estimators are available with different bias-variance tradeoffs.
Usage
Used in PPO and Math-GRPO workflows during reward computation. The KL penalty is added to the per-token reward signal.
Theoretical Basis
k1 estimator (standard): Simple log-ratio
k2 estimator (non-negative): Squared log-ratio
k3 estimator (non-negative, Schulman):
All estimators are clamped to [-10, 10] for numerical stability.