Heuristic:Alibaba ROLL KL Coefficient Tuning
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, LLMs, Optimization |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
Adaptive KL penalty coefficient starting at 0.2 with proportional error clipped to 20%, using a horizon of 10,000 steps for smooth adjustment.
Description
ROLL implements an adaptive KL controller (from the Ziegler et al. 2019 paper) that dynamically adjusts the KL penalty coefficient during training. The controller multiplies the current coefficient by `1 + proportional_error * n_steps / horizon`, where the proportional error is `clip(current_kl / target_kl - 1, -0.2, 0.2)`. The 20% clip on proportional error prevents aggressive coefficient swings. A horizon of 10,000 steps provides gradual adjustment. Alternatively, a fixed KL coefficient can be used when `target_kl` is not set. Multiple KL penalty modes are available: standard KL, absolute, MSE, k3 (clamped), and full KL divergence.
Usage
Use adaptive KL control when training with PPO/GRPO and observing policy drift (model diverging too far from the reference model). Start with `init_kl_coef=0.2`. If the model is not exploring enough, reduce the initial coefficient or set a higher `target_kl`. If the model is diverging too quickly, increase the initial coefficient or set a lower `target_kl`.
The Insight (Rule of Thumb)
- Action: Set `init_kl_coef=0.2` and `kl_horizon=10000`. Optionally set `target_kl` for adaptive control.
- Value: `init_kl_coef=0.2`, `kl_horizon=10000`, proportional error clip +-0.2.
- Trade-off: Higher KL coefficient constrains policy changes (safer but slower learning); lower coefficient allows more freedom (faster but risks divergence from reference).
- Fixed vs Adaptive: Use fixed (`target_kl=None`) for stable training; use adaptive for dynamic environments.
Reasoning
The KL penalty prevents the policy from deviating too far from the reference model, which would lead to reward hacking or mode collapse. The adaptive controller from Ziegler et al. 2019 automatically adjusts the penalty strength: if current KL exceeds the target, the coefficient increases; if KL is below target, it decreases. The 20% clip prevents the coefficient from changing too rapidly in a single update, which could cause training oscillation. The 10,000-step horizon provides a long time constant for smooth adjustment.
Code from `roll/utils/kl_controller.py:6-21`:
class AdaptiveKLController:
def __init__(self, init_kl_coef, target, horizon):
self.value = init_kl_coef
self.target = target
self.horizon = horizon
def update(self, current, n_steps):
target = self.target
proportional_error = np.clip(current / target - 1, -0.2, 0.2)
mult = 1 + proportional_error * n_steps / self.horizon
self.value *= mult
Configuration defaults from `roll/configs/base_config.py:450-453`:
init_kl_coef: float = field(default=0.2)
kl_horizon: int = field(default=10000)