Heuristic:Alibaba ROLL KL Coefficient Tuning

Knowledge Sources	Fine-Tuning Language Models from Human Preferences Alibaba ROLL
Domains	Reinforcement_Learning, LLMs, Optimization
Last Updated	2026-02-07 19:00 GMT

Overview

Adaptive KL penalty coefficient starting at 0.2 with proportional error clipped to 20%, using a horizon of 10,000 steps for smooth adjustment.

Description

ROLL implements an adaptive KL controller (from the Ziegler et al. 2019 paper) that dynamically adjusts the KL penalty coefficient during training. The controller multiplies the current coefficient by `1 + proportional_error * n_steps / horizon`, where the proportional error is `clip(current_kl / target_kl - 1, -0.2, 0.2)`. The 20% clip on proportional error prevents aggressive coefficient swings. A horizon of 10,000 steps provides gradual adjustment. Alternatively, a fixed KL coefficient can be used when `target_kl` is not set. Multiple KL penalty modes are available: standard KL, absolute, MSE, k3 (clamped), and full KL divergence.

Usage

Use adaptive KL control when training with PPO/GRPO and observing policy drift (model diverging too far from the reference model). Start with `init_kl_coef=0.2`. If the model is not exploring enough, reduce the initial coefficient or set a higher `target_kl`. If the model is diverging too quickly, increase the initial coefficient or set a lower `target_kl`.

The Insight (Rule of Thumb)

Action: Set `init_kl_coef=0.2` and `kl_horizon=10000`. Optionally set `target_kl` for adaptive control.
Value: `init_kl_coef=0.2`, `kl_horizon=10000`, proportional error clip +-0.2.
Trade-off: Higher KL coefficient constrains policy changes (safer but slower learning); lower coefficient allows more freedom (faster but risks divergence from reference).
Fixed vs Adaptive: Use fixed (`target_kl=None`) for stable training; use adaptive for dynamic environments.

Reasoning

The KL penalty prevents the policy from deviating too far from the reference model, which would lead to reward hacking or mode collapse. The adaptive controller from Ziegler et al. 2019 automatically adjusts the penalty strength: if current KL exceeds the target, the coefficient increases; if KL is below target, it decreases. The 20% clip prevents the coefficient from changing too rapidly in a single update, which could cause training oscillation. The 10,000-step horizon provides a long time constant for smooth adjustment.

Code from `roll/utils/kl_controller.py:6-21`:

class AdaptiveKLController:
    def __init__(self, init_kl_coef, target, horizon):
        self.value = init_kl_coef
        self.target = target
        self.horizon = horizon

    def update(self, current, n_steps):
        target = self.target
        proportional_error = np.clip(current / target - 1, -0.2, 0.2)
        mult = 1 + proportional_error * n_steps / self.horizon
        self.value *= mult

Configuration defaults from `roll/configs/base_config.py:450-453`:

init_kl_coef: float = field(default=0.2)
kl_horizon: int = field(default=10000)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment