Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Huggingface Alignment handbook DPO Beta Selection

From Leeroopedia




Knowledge Sources
Domains LLMs, Optimization
Last Updated 2026-02-07 00:00 GMT

Overview

DPO beta parameter controls deviation from reference model: use 0.01 for standard DPO and 0.05 for ORPO and APO-Zero variants.

Description

The beta parameter in DPO controls how strongly the policy is constrained to stay close to the reference model. A smaller beta allows the model to deviate more from the reference, learning stronger preferences. The alignment-handbook uses beta=0.01 for standard DPO training (Zephyr-7B) and beta=0.05 for ORPO (Zephyr-141B) and APO-Zero (SmolLM3). The higher beta in ORPO/APO-Zero compensates for the lack of a separate reference model.

Usage

Apply this when configuring DPO, ORPO, or APO-Zero training. Choose beta based on the training method and how aggressively you want preference alignment.

The Insight (Rule of Thumb)

  • Action: Set `beta` based on the preference alignment method.
  • Value:
    • Standard DPO: `beta: 0.01` (aggressive preference learning with reference model)
    • ORPO: `beta: 0.05` (moderate, no reference model)
    • APO-Zero: `beta: 0.05` (moderate, anchored preference optimization)
  • Trade-off: Lower beta = stronger preference alignment but more risk of reward hacking; higher beta = more conservative but potentially less impactful alignment.

Reasoning

DPO's beta controls the KL divergence penalty between the policy and reference model. With a reference model present (standard DPO), a lower beta (0.01) is safe because the reference model provides a strong anchor. In ORPO and APO-Zero, there is no separate reference model, so a higher beta (0.05) helps prevent the model from diverging too far from the initial policy.

Standard DPO config from `recipes/zephyr-7b-beta/dpo/config_full.yaml:28`:

beta: 0.01

ORPO config from `recipes/zephyr-141b-A35b/orpo/config_full.yaml:15`:

beta: 0.05

APO-Zero config from `recipes/smollm3/dpo/apo.yaml:51-52`:

beta: 0.05
loss_type: apo_zero

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment