Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:ContextualAI HALOs Humanline Clamping

From Leeroopedia



Knowledge Sources
Domains LLM_Alignment, Optimization, Prospect_Theory
Last Updated 2026-02-08 03:00 GMT

Overview

Prospect-theoretic per-token log-ratio clamping that bounds the policy-reference divergence to improve alignment stability and prevent reward hacking.

Description

Humanline alignment applies clamping to the per-token log-ratio between the policy and reference model, inspired by prospect theory from behavioral economics. Instead of allowing unbounded divergence between `log p_policy(token) - log p_reference(token)`, the ratio is clamped to the interval `[log_epsilon_P, log_epsilon_R]`. This prevents the policy from deviating too far from the reference in either direction at the token level. The name "humanline" reflects its grounding in prospect-theoretic loss functions that model human decision-making biases.

Usage

Use this heuristic when you want to constrain per-token divergence between the policy and reference model during alignment training. It is particularly useful when training with DPO, KTO, GRPO, or PPO and you observe reward hacking (the model exploiting the reward signal without genuinely improving). Enable it by setting `humanline=true` in the config, and tune the clamping bounds `log_epsilon_P` (lower bound, default -1.0) and `log_epsilon_R` (upper bound, default 1.5). When humanline is active, the reference model is synced with the policy after each step (`sync_reference_with_policy`).

The Insight (Rule of Thumb)

  • Action: Set `++humanline=true ++log_epsilon_P=-1.0 ++log_epsilon_R=1.5` in the launch command.
  • Value: Default bounds are `log_epsilon_P=-1.0` (lower) and `log_epsilon_R=1.5` (upper). These can be tuned per experiment.
  • Trade-off: Tighter bounds prevent reward hacking but may slow convergence. The reference model must be synced with the policy each step, adding communication overhead.
  • Monitoring: Track the `unclamped` metric reported in training logs. Values near 1.0 mean most tokens are within bounds; values near 0.0 mean aggressive clamping is occurring.

Reasoning

Standard alignment methods (DPO, KTO) compute a log-ratio between policy and reference at each token. Without bounds, the policy can exploit individual tokens by assigning extreme probabilities (very high for preferred tokens, very low for dispreferred ones), leading to reward hacking. Prospect theory suggests humans are loss-averse and have diminishing sensitivity to gains/losses beyond reference points. Humanline implements this by clamping the token-level log-ratio, creating a "humanlike" sensitivity function. The `sync_reference` behavior when humanline is active ensures the reference tracks the policy, making the clamping bounds relative to the current policy rather than a fixed starting point.

Code Evidence

Token-level clamping in `train/trainers.py:148-161`:

def get_ratios(self, policy_logps, reference_logps):
    if self.config.humanline:
        logratio = (policy_logps - reference_logps).clamp(
            self.config.log_epsilon_P, self.config.log_epsilon_R)
        detached = (policy_logps - reference_logps).detach()
        unclamped = (self.config.log_epsilon_P < detached) & (detached < self.config.log_epsilon_R)
        unclamped = ((unclamped & (detached != 0)).float().sum() /
                     (detached != 0).float().sum().clamp(min=1)).clamp(max=1, min=0)
    else:
        logratio = policy_logps - reference_logps
        unclamped = torch.Tensor([1]).to(self.policy_dtype).to(self.accelerator.device)
    ratio = logratio.exp()
    return ratio, unclamped

Reference sync triggered by humanline in `train/trainers.py:364`:

if self.config.loss.sync_reference or self.config.humanline:
    self.sync_reference_with_policy()

Default config values in `config/config.yaml:94-97`:

humanline: false
log_epsilon_P: -1.0
log_epsilon_R: 1.5

Launch script usage in `scripts/launch_llama_instruct_dpo_humanline.sh:76`:

++humanline=true ++log_epsilon_P=${L} ++log_epsilon_R=${U}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment