Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:LLMBook zh LLMBook zh github io DPO Beta Hyperparameter

From Leeroopedia




Knowledge Sources
Domains LLMs, Alignment, RLHF
Last Updated 2026-02-08 04:30 GMT

Overview

Set DPO beta=0.1 to control the KL divergence penalty between the policy model and the reference model during preference alignment.

Description

The beta hyperparameter in Direct Preference Optimization controls how much the trained policy is allowed to diverge from the initial reference model. A lower beta allows more divergence (potentially better alignment but risk of reward hacking), while a higher beta constrains the policy closer to the reference (safer but potentially less aligned). The codebase uses beta=0.1 as the default, which is the value recommended in the original DPO paper.

Usage

Use this heuristic when configuring DPO alignment training. Start with beta=0.1 as the default. Increase beta (0.3-0.5) if you observe reward hacking or quality degradation. Decrease beta (0.01-0.05) if the model is not learning preference patterns.

The Insight (Rule of Thumb)

  • Action: Set `beta=0.1` in DPO training configuration.
  • Value: 0.1 (default from the original DPO paper).
  • Trade-off: Higher beta = less divergence from reference (more conservative). Lower beta = more divergence (more aggressive alignment, risk of collapse).
  • Companion Setting: DPO uses shorter context windows (`model_max_length=512`) compared to pre-training/SFT (2048).

Reasoning

The DPO loss function is: `L_DPO = -log(sigma(beta * (log(pi/pi_ref)(y_w) - log(pi/pi_ref)(y_l))))`. Beta scales the log-probability ratio difference between chosen and rejected responses. At beta=0.1, the model is gently pushed toward preferred responses without dramatically departing from the reference policy. The 512-token context limit (vs. 2048 for pre-training) reflects that preference data (human-chosen vs. rejected responses) typically involves shorter conversational exchanges.

Code Evidence:

DPO beta configuration from `code/8.2 DPO实践.py:28-32`:

# DPO 中使用的超参数 beta
beta: float = HfArg(
    default=0.1,
    help="The beta factor in DPO loss."
    "Higher beta means less divergence from the initial policy.",
)

Shorter context window for DPO from `code/8.2 DPO实践.py:21`:

model_max_length: int = HfArg(default=512, help="Maximum sequence length.")

Reference model frozen from `code/8.2 DPO实践.py:62-64`:

model_ref.eval()
for param in model_ref.parameters():
    param.requires_grad = False

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment