Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:OpenRLHF OpenRLHF Compute approx kl

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Loss_Functions
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for approximating KL divergence from sampled log-probabilities provided by OpenRLHF.

Description

The compute_approx_kl function computes an approximate KL divergence between two distributions using only their log-probabilities at sampled points. It supports three estimators (k1, k2, k3) with different bias-variance properties, and clamps results to [-10, 10] for numerical stability.

Usage

Called during PPO experience generation to compute the KL penalty between the current policy and the reference model.

Code Reference

Source Location

  • Repository: OpenRLHF
  • File: openrlhf/models/utils.py
  • Lines: L7-41

Signature

def compute_approx_kl(
    log_probs: torch.Tensor,       # Log-probs from current policy
    log_probs_base: torch.Tensor,  # Log-probs from reference policy
    kl_estimator: str = "k1",      # Estimator: "k1", "k2", or "k3"
) -> torch.Tensor:
    """
    Compute approximate KL divergence between two distributions.

    Returns:
        Tensor: Per-token KL estimates, clamped to [-10, 10]
    """

Import

from openrlhf.models.utils import compute_approx_kl

I/O Contract

Inputs

Name Type Required Description
log_probs Tensor Yes Log-probabilities from current policy (batch, seq)
log_probs_base Tensor Yes Log-probabilities from reference (batch, seq)
kl_estimator str No Estimator type: "k1", "k2", "k3" (default "k1")

Outputs

Name Type Description
kl Tensor Per-token KL estimates (batch, seq), clamped [-10, 10]

Usage Examples

from openrlhf.models.utils import compute_approx_kl

kl = compute_approx_kl(
    policy_log_probs,
    ref_log_probs,
    kl_estimator="k1",
)
# kl shape: (batch_size, seq_len)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment