Heuristic:Huggingface Alignment handbook DPO Beta Selection

Knowledge Sources	Alignment Handbook DPO Internal
Domains	LLMs, Optimization
Last Updated	2026-02-07 00:00 GMT

Overview

DPO beta parameter controls deviation from reference model: use 0.01 for standard DPO and 0.05 for ORPO and APO-Zero variants.

Description

The beta parameter in DPO controls how strongly the policy is constrained to stay close to the reference model. A smaller beta allows the model to deviate more from the reference, learning stronger preferences. The alignment-handbook uses beta=0.01 for standard DPO training (Zephyr-7B) and beta=0.05 for ORPO (Zephyr-141B) and APO-Zero (SmolLM3). The higher beta in ORPO/APO-Zero compensates for the lack of a separate reference model.

Usage

Apply this when configuring DPO, ORPO, or APO-Zero training. Choose beta based on the training method and how aggressively you want preference alignment.

The Insight (Rule of Thumb)

Action: Set `beta` based on the preference alignment method.
Value:
- Standard DPO: `beta: 0.01` (aggressive preference learning with reference model)
- ORPO: `beta: 0.05` (moderate, no reference model)
- APO-Zero: `beta: 0.05` (moderate, anchored preference optimization)
Trade-off: Lower beta = stronger preference alignment but more risk of reward hacking; higher beta = more conservative but potentially less impactful alignment.

Reasoning

DPO's beta controls the KL divergence penalty between the policy and reference model. With a reference model present (standard DPO), a lower beta (0.01) is safe because the reference model provides a strong anchor. In ORPO and APO-Zero, there is no separate reference model, so a higher beta (0.05) helps prevent the model from diverging too far from the initial policy.

Standard DPO config from `recipes/zephyr-7b-beta/dpo/config_full.yaml:28`:

beta: 0.01

ORPO config from `recipes/zephyr-141b-A35b/orpo/config_full.yaml:15`:

beta: 0.05

APO-Zero config from `recipes/smollm3/dpo/apo.yaml:51-52`:

beta: 0.05
loss_type: apo_zero

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment