Implementation:OpenRLHF OpenRLHF Compute approx kl

Knowledge Sources	OpenRLHF
Domains	Reinforcement_Learning, Loss_Functions
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for approximating KL divergence from sampled log-probabilities provided by OpenRLHF.

Description

The compute_approx_kl function computes an approximate KL divergence between two distributions using only their log-probabilities at sampled points. It supports three estimators (k1, k2, k3) with different bias-variance properties, and clamps results to [-10, 10] for numerical stability.

Usage

Called during PPO experience generation to compute the KL penalty between the current policy and the reference model.

Code Reference

Source Location

Repository: OpenRLHF
File: openrlhf/models/utils.py
Lines: L7-41

Signature

def compute_approx_kl(
    log_probs: torch.Tensor,       # Log-probs from current policy
    log_probs_base: torch.Tensor,  # Log-probs from reference policy
    kl_estimator: str = "k1",      # Estimator: "k1", "k2", or "k3"
) -> torch.Tensor:
    """
    Compute approximate KL divergence between two distributions.

    Returns:
        Tensor: Per-token KL estimates, clamped to [-10, 10]
    """

Import

from openrlhf.models.utils import compute_approx_kl

I/O Contract

Inputs

Name	Type	Required	Description
log_probs	Tensor	Yes	Log-probabilities from current policy (batch, seq)
log_probs_base	Tensor	Yes	Log-probabilities from reference (batch, seq)
kl_estimator	str	No	Estimator type: "k1", "k2", "k3" (default "k1")

Outputs

Name	Type	Description
kl	Tensor	Per-token KL estimates (batch, seq), clamped [-10, 10]

Usage Examples

from openrlhf.models.utils import compute_approx_kl

kl = compute_approx_kl(
    policy_log_probs,
    ref_log_probs,
    kl_estimator="k1",
)
# kl shape: (batch_size, seq_len)

Related Pages

Implements Principle

Principle:OpenRLHF_OpenRLHF_KL_Divergence_Estimation

Requires Environment

Environment:OpenRLHF_OpenRLHF_Flash_Attention_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment