Heuristic:Princeton nlp SimPO Hyperparameter Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs, Preference_Optimization |
| Last Updated | 2026-02-08 05:00 GMT |
Overview
Hyperparameter selection guide for SimPO training: learning_rate, beta, and gamma_beta_ratio with recommended search ranges and per-model settings.
Description
SimPO has three critical hyperparameters that require careful tuning: learning_rate, beta, and gamma_beta_ratio (gamma / beta). The learning rate is the most sensitive parameter; values that are too large (e.g., 1e-5) cause catastrophic forgetting, producing incoherent or repetitive outputs. Beta controls reward scaling and must be much larger in SimPO than in DPO (SimPO uses 2.0-10.0 vs DPO's 0.01). The gamma_beta_ratio controls the target reward margin and provides modest improvements when well-tuned. The total batch size should be kept fixed at 128 (achieved via per_device_train_batch_size * gradient_accumulation_steps * num_gpus).
Usage
Use this heuristic when configuring SimPO training for a new model or dataset. It is relevant any time you are selecting hyperparameters for the SimPOTrainer or creating a new training YAML config. Also reference this when observing training degradation (incoherent outputs, format forgetting, repetitive text).
The Insight (Rule of Thumb)
- Action: Grid search learning_rate over {3e-7, 5e-7, 8e-7, 1e-6}. Use smaller values (e.g., 5e-7) for reasoning-intensive domains like math.
- Action: Set beta much larger than DPO. Start with 2.0-2.5, but consider up to 10 for instruct models.
- Action: Set gamma_beta_ratio starting at 0.5, grid search between 0 and 1. This is less critical than learning_rate and beta.
- Action: Fix total batch size at 128 (per_device_train_batch_size * gradient_accumulation_steps * num_GPUs).
- Trade-off: Large learning rate degrades output quality (incoherence, repetition). Large beta increases reward scaling sensitivity. SFT regularization (sft_weight > 0) preserves reasoning but degrades chat performance.
Recommended hyperparameters per model:
| Setting | Beta | Gamma/Beta | Learning Rate |
|---|---|---|---|
| Mistral-Base | 2.0 | 0.8 | 3e-7 |
| Mistral-Instruct | 2.5 | 0.1 | 5e-7 |
| Llama3-Base | 2.0 | 0.5 | 6e-7 |
| Llama3-Instruct | 2.5 | 0.55 | 1e-6 |
| Llama3-Instruct v0.2 | 10 | 0.3 | 1e-6 |
| Gemma-2-9B-IT | 10 | 0.5 | 8e-7 |
Reasoning
SimPO's loss function is reference-free and uses length-normalized log probabilities, which changes the scale of the reward signal compared to DPO. This requires larger beta values to maintain sufficient separation between chosen and rejected responses. The learning rate sensitivity comes from the fact that preference optimization can drastically alter model behavior with only small weight changes. The authors found that Gemma-2 models exhibit significantly less catastrophic forgetting than Llama-3 on math tasks, meaning model choice affects robustness to learning rate. Using ArmoRM (a strong reward model) for dataset annotation improved results substantially in v0.2 experiments, suggesting data quality matters as much as hyperparameters.
Evidence from README:
- "A large learning rate (e.g., 1e-5) can significantly degrade performance, causing the model to produce incoherent sentences or completely repetitive responses."
- "SimPO requires a much larger beta than DPO."
- "A well-tuned gamma_beta_ratio can provide a modest improvement, but it is not as critical as other hyperparameters."