Heuristic:ContextualAI HALOs Humanline Clamping
| Knowledge Sources | |
|---|---|
| Domains | LLM_Alignment, Optimization, Prospect_Theory |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
Prospect-theoretic per-token log-ratio clamping that bounds the policy-reference divergence to improve alignment stability and prevent reward hacking.
Description
Humanline alignment applies clamping to the per-token log-ratio between the policy and reference model, inspired by prospect theory from behavioral economics. Instead of allowing unbounded divergence between `log p_policy(token) - log p_reference(token)`, the ratio is clamped to the interval `[log_epsilon_P, log_epsilon_R]`. This prevents the policy from deviating too far from the reference in either direction at the token level. The name "humanline" reflects its grounding in prospect-theoretic loss functions that model human decision-making biases.
Usage
Use this heuristic when you want to constrain per-token divergence between the policy and reference model during alignment training. It is particularly useful when training with DPO, KTO, GRPO, or PPO and you observe reward hacking (the model exploiting the reward signal without genuinely improving). Enable it by setting `humanline=true` in the config, and tune the clamping bounds `log_epsilon_P` (lower bound, default -1.0) and `log_epsilon_R` (upper bound, default 1.5). When humanline is active, the reference model is synced with the policy after each step (`sync_reference_with_policy`).
The Insight (Rule of Thumb)
- Action: Set `++humanline=true ++log_epsilon_P=-1.0 ++log_epsilon_R=1.5` in the launch command.
- Value: Default bounds are `log_epsilon_P=-1.0` (lower) and `log_epsilon_R=1.5` (upper). These can be tuned per experiment.
- Trade-off: Tighter bounds prevent reward hacking but may slow convergence. The reference model must be synced with the policy each step, adding communication overhead.
- Monitoring: Track the `unclamped` metric reported in training logs. Values near 1.0 mean most tokens are within bounds; values near 0.0 mean aggressive clamping is occurring.
Reasoning
Standard alignment methods (DPO, KTO) compute a log-ratio between policy and reference at each token. Without bounds, the policy can exploit individual tokens by assigning extreme probabilities (very high for preferred tokens, very low for dispreferred ones), leading to reward hacking. Prospect theory suggests humans are loss-averse and have diminishing sensitivity to gains/losses beyond reference points. Humanline implements this by clamping the token-level log-ratio, creating a "humanlike" sensitivity function. The `sync_reference` behavior when humanline is active ensures the reference tracks the policy, making the clamping bounds relative to the current policy rather than a fixed starting point.
Code Evidence
Token-level clamping in `train/trainers.py:148-161`:
def get_ratios(self, policy_logps, reference_logps):
if self.config.humanline:
logratio = (policy_logps - reference_logps).clamp(
self.config.log_epsilon_P, self.config.log_epsilon_R)
detached = (policy_logps - reference_logps).detach()
unclamped = (self.config.log_epsilon_P < detached) & (detached < self.config.log_epsilon_R)
unclamped = ((unclamped & (detached != 0)).float().sum() /
(detached != 0).float().sum().clamp(min=1)).clamp(max=1, min=0)
else:
logratio = policy_logps - reference_logps
unclamped = torch.Tensor([1]).to(self.policy_dtype).to(self.accelerator.device)
ratio = logratio.exp()
return ratio, unclamped
Reference sync triggered by humanline in `train/trainers.py:364`:
if self.config.loss.sync_reference or self.config.humanline:
self.sync_reference_with_policy()
Default config values in `config/config.yaml:94-97`:
humanline: false
log_epsilon_P: -1.0
log_epsilon_R: 1.5
Launch script usage in `scripts/launch_llama_instruct_dpo_humanline.sh:76`:
++humanline=true ++log_epsilon_P=${L} ++log_epsilon_R=${U}