Heuristic:Huggingface Alignment handbook DPO Beta Selection
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Optimization |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
DPO beta parameter controls deviation from reference model: use 0.01 for standard DPO and 0.05 for ORPO and APO-Zero variants.
Description
The beta parameter in DPO controls how strongly the policy is constrained to stay close to the reference model. A smaller beta allows the model to deviate more from the reference, learning stronger preferences. The alignment-handbook uses beta=0.01 for standard DPO training (Zephyr-7B) and beta=0.05 for ORPO (Zephyr-141B) and APO-Zero (SmolLM3). The higher beta in ORPO/APO-Zero compensates for the lack of a separate reference model.
Usage
Apply this when configuring DPO, ORPO, or APO-Zero training. Choose beta based on the training method and how aggressively you want preference alignment.
The Insight (Rule of Thumb)
- Action: Set `beta` based on the preference alignment method.
- Value:
- Standard DPO: `beta: 0.01` (aggressive preference learning with reference model)
- ORPO: `beta: 0.05` (moderate, no reference model)
- APO-Zero: `beta: 0.05` (moderate, anchored preference optimization)
- Trade-off: Lower beta = stronger preference alignment but more risk of reward hacking; higher beta = more conservative but potentially less impactful alignment.
Reasoning
DPO's beta controls the KL divergence penalty between the policy and reference model. With a reference model present (standard DPO), a lower beta (0.01) is safe because the reference model provides a strong anchor. In ORPO and APO-Zero, there is no separate reference model, so a higher beta (0.05) helps prevent the model from diverging too far from the initial policy.
Standard DPO config from `recipes/zephyr-7b-beta/dpo/config_full.yaml:28`:
beta: 0.01
ORPO config from `recipes/zephyr-141b-A35b/orpo/config_full.yaml:15`:
beta: 0.05
APO-Zero config from `recipes/smollm3/dpo/apo.yaml:51-52`:
beta: 0.05
loss_type: apo_zero