Workflow:Princeton nlp SimPO SimPO Training

Knowledge Sources	SimPO SimPO: Simple Preference Optimization with a Reference-Free Reward HuggingFace Alignment Handbook
Domains	LLMs, Preference_Optimization, Fine_Tuning
Last Updated	2026-02-08 04:00 GMT

Overview

End-to-end process for fine-tuning large language models using SimPO (Simple Preference Optimization), a reference-free preference optimization algorithm that uses length-normalized average log probabilities as implicit rewards.

Description

This workflow implements the SimPO training pipeline for aligning language models with human preferences. SimPO eliminates the need for a reference model (unlike DPO) by using the average log probability of a sequence as an implicit reward signal, combined with a target reward margin to encourage separation between preferred and dispreferred responses. The pipeline covers environment configuration via YAML files, preference dataset loading with proportional mixing, chat template application for prompt/chosen/rejected formatting, distributed model training with DeepSpeed ZeRO-3 or FSDP, and model saving with optional Hub upload.

Key features of SimPO:

Reference-free: no second model needed during training, reducing memory by ~50%
Length-normalized rewards: prevents bias toward longer responses
Three critical hyperparameters: learning_rate, beta (reward scaling), and gamma_beta_ratio (target margin)
Supports Llama-3, Mistral, and Gemma model families
Config-driven: all settings specified via YAML files

Usage

Execute this workflow when you have a preference dataset in OpenAI message format (with chosen/rejected response pairs) and want to align a base or instruction-tuned language model to follow human preferences. This is the primary workflow for reproducing the SimPO paper results or applying SimPO to custom tasks. Requires multi-GPU hardware (designed for 4xH100) with DeepSpeed ZeRO-3 or FSDP for distributed training.

Execution Steps

Step 1: Environment Setup

Prepare the Python environment with all required dependencies. This includes installing PyTorch, the HuggingFace alignment-handbook package (which provides the base training infrastructure), and Flash Attention 2 for efficient attention computation. The environment specification is captured in a Conda environment file for reproducibility.

Key considerations:

Use Python 3.10 with Conda
Install PyTorch v2.2.2 matching your CUDA version
Install alignment-handbook from source (provides core utilities)
Flash Attention 2 is required for efficient training

Step 2: Configuration

Select and customize a YAML training configuration file that specifies all model, data, and training hyperparameters. The configuration system uses H4ArgumentParser to merge YAML settings with optional command-line overrides. Each configuration targets a specific model family and variant (base vs instruct).

What is configured:

Model path (e.g., meta-llama/Meta-Llama-3-8B-Instruct)
Dataset mixer with proportions (e.g., princeton-nlp/llama3-ultrafeedback: 1.0)
SimPO-specific hyperparameters: beta, gamma_beta_ratio, sft_weight
Training parameters: learning_rate, batch_size, gradient_accumulation, max_length
Distributed training backend: DeepSpeed ZeRO-3 or FSDP
Attention implementation: flash_attention_2

Step 3: Dataset Loading and Chat Template Application

Load the preference dataset using the dataset mixer (which supports proportional sampling from multiple sources), then apply model-specific chat templates to format each example into prompt/chosen/rejected triplets. The chat template application handles BOS token stripping from responses, system message insertion, and model-specific formatting (e.g., Mistral instruction template).

What happens:

Dataset loaded via HuggingFace datasets with configurable splits (train/test)
Each example validated for OpenAI message format (role/content dicts)
Prompt extracted from conversation history (all turns except the last)
Chosen and rejected responses formatted with the model's chat template
BOS tokens stripped from response beginnings to prevent double-BOS issues
Columns renamed to match TRL trainer expectations (text_prompt → prompt, etc.)

Step 4: Model and Tokenizer Initialization

Load the tokenizer with appropriate settings (left truncation, pad token configuration, chat template assignment) and prepare the model loading kwargs including dtype, quantization config (optional 4-bit/8-bit), attention implementation, and gradient checkpointing settings. The model is passed as a string path to the SimPOTrainer, which handles lazy loading.

Key considerations:

Tokenizer truncation side set to "left" to preserve response labels
Pad token defaults to EOS token if not set
Model max length capped at 2048 if the tokenizer reports an unreasonably large value
Quantization optional via BitsAndBytesConfig (4-bit NF4 or 8-bit)
LoRA/PEFT configuration available for parameter-efficient training

Step 5: SimPO Training

Instantiate the SimPOTrainer (which extends HuggingFace Trainer) with the model, tokenizer, datasets, and configuration. The trainer handles the core SimPO loss computation: it calculates length-normalized average log probabilities for chosen and rejected responses, then applies a sigmoid (or hinge) loss on the margin between them, offset by the gamma target margin. Training uses distributed execution via accelerate with the configured backend.

Core algorithm:

For each batch, compute average log probabilities: avg_logp = sum(logp) / length
SimPO loss = -log(sigmoid(beta * (avg_logp_chosen - avg_logp_rejected - gamma)))
Where gamma = beta * gamma_beta_ratio
Optional SFT regularization loss on chosen responses (controlled by sft_weight)
Dropout disabled during training for stability
Gradient checkpointing used to reduce memory footprint

Step 6: Model Saving and Evaluation

After training completes, save the trained model weights, training metrics, and trainer state to the output directory. Optionally run evaluation on the test split to compute eval loss and accuracy metrics. A model card is generated with dataset and training provenance. The model can be optionally pushed to the HuggingFace Hub.

What is saved:

Full model weights (or adapter weights if using PEFT)
Training metrics (loss, learning rate, samples processed)
Evaluation metrics (eval loss, eval accuracy) if do_eval is enabled
Model card with finetuning provenance and dataset tags
Model config with use_cache re-enabled for inference

Execution Diagram

GitHub URL

Workflow Repository