Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Princeton nlp SimPO SimPO Training

From Leeroopedia


Knowledge Sources
Domains LLMs, Preference_Optimization, Fine_Tuning
Last Updated 2026-02-08 04:00 GMT

Overview

End-to-end process for fine-tuning large language models using SimPO (Simple Preference Optimization), a reference-free preference optimization algorithm that uses length-normalized average log probabilities as implicit rewards.

Description

This workflow implements the SimPO training pipeline for aligning language models with human preferences. SimPO eliminates the need for a reference model (unlike DPO) by using the average log probability of a sequence as an implicit reward signal, combined with a target reward margin to encourage separation between preferred and dispreferred responses. The pipeline covers environment configuration via YAML files, preference dataset loading with proportional mixing, chat template application for prompt/chosen/rejected formatting, distributed model training with DeepSpeed ZeRO-3 or FSDP, and model saving with optional Hub upload.

Key features of SimPO:

  • Reference-free: no second model needed during training, reducing memory by ~50%
  • Length-normalized rewards: prevents bias toward longer responses
  • Three critical hyperparameters: learning_rate, beta (reward scaling), and gamma_beta_ratio (target margin)
  • Supports Llama-3, Mistral, and Gemma model families
  • Config-driven: all settings specified via YAML files

Usage

Execute this workflow when you have a preference dataset in OpenAI message format (with chosen/rejected response pairs) and want to align a base or instruction-tuned language model to follow human preferences. This is the primary workflow for reproducing the SimPO paper results or applying SimPO to custom tasks. Requires multi-GPU hardware (designed for 4xH100) with DeepSpeed ZeRO-3 or FSDP for distributed training.

Execution Steps

Step 1: Environment Setup

Prepare the Python environment with all required dependencies. This includes installing PyTorch, the HuggingFace alignment-handbook package (which provides the base training infrastructure), and Flash Attention 2 for efficient attention computation. The environment specification is captured in a Conda environment file for reproducibility.

Key considerations:

  • Use Python 3.10 with Conda
  • Install PyTorch v2.2.2 matching your CUDA version
  • Install alignment-handbook from source (provides core utilities)
  • Flash Attention 2 is required for efficient training

Step 2: Configuration

Select and customize a YAML training configuration file that specifies all model, data, and training hyperparameters. The configuration system uses H4ArgumentParser to merge YAML settings with optional command-line overrides. Each configuration targets a specific model family and variant (base vs instruct).

What is configured:

  • Model path (e.g., meta-llama/Meta-Llama-3-8B-Instruct)
  • Dataset mixer with proportions (e.g., princeton-nlp/llama3-ultrafeedback: 1.0)
  • SimPO-specific hyperparameters: beta, gamma_beta_ratio, sft_weight
  • Training parameters: learning_rate, batch_size, gradient_accumulation, max_length
  • Distributed training backend: DeepSpeed ZeRO-3 or FSDP
  • Attention implementation: flash_attention_2

Step 3: Dataset Loading and Chat Template Application

Load the preference dataset using the dataset mixer (which supports proportional sampling from multiple sources), then apply model-specific chat templates to format each example into prompt/chosen/rejected triplets. The chat template application handles BOS token stripping from responses, system message insertion, and model-specific formatting (e.g., Mistral instruction template).

What happens:

  • Dataset loaded via HuggingFace datasets with configurable splits (train/test)
  • Each example validated for OpenAI message format (role/content dicts)
  • Prompt extracted from conversation history (all turns except the last)
  • Chosen and rejected responses formatted with the model's chat template
  • BOS tokens stripped from response beginnings to prevent double-BOS issues
  • Columns renamed to match TRL trainer expectations (text_prompt → prompt, etc.)

Step 4: Model and Tokenizer Initialization

Load the tokenizer with appropriate settings (left truncation, pad token configuration, chat template assignment) and prepare the model loading kwargs including dtype, quantization config (optional 4-bit/8-bit), attention implementation, and gradient checkpointing settings. The model is passed as a string path to the SimPOTrainer, which handles lazy loading.

Key considerations:

  • Tokenizer truncation side set to "left" to preserve response labels
  • Pad token defaults to EOS token if not set
  • Model max length capped at 2048 if the tokenizer reports an unreasonably large value
  • Quantization optional via BitsAndBytesConfig (4-bit NF4 or 8-bit)
  • LoRA/PEFT configuration available for parameter-efficient training

Step 5: SimPO Training

Instantiate the SimPOTrainer (which extends HuggingFace Trainer) with the model, tokenizer, datasets, and configuration. The trainer handles the core SimPO loss computation: it calculates length-normalized average log probabilities for chosen and rejected responses, then applies a sigmoid (or hinge) loss on the margin between them, offset by the gamma target margin. Training uses distributed execution via accelerate with the configured backend.

Core algorithm:

  • For each batch, compute average log probabilities: avg_logp = sum(logp) / length
  • SimPO loss = -log(sigmoid(beta * (avg_logp_chosen - avg_logp_rejected - gamma)))
  • Where gamma = beta * gamma_beta_ratio
  • Optional SFT regularization loss on chosen responses (controlled by sft_weight)
  • Dropout disabled during training for stability
  • Gradient checkpointing used to reduce memory footprint

Step 6: Model Saving and Evaluation

After training completes, save the trained model weights, training metrics, and trainer state to the output directory. Optionally run evaluation on the test split to compute eval loss and accuracy metrics. A model card is generated with dataset and training provenance. The model can be optionally pushed to the HuggingFace Hub.

What is saved:

  • Full model weights (or adapter weights if using PEFT)
  • Training metrics (loss, learning rate, samples processed)
  • Evaluation metrics (eval loss, eval accuracy) if do_eval is enabled
  • Model card with finetuning provenance and dataset tags
  • Model config with use_cache re-enabled for inference

Execution Diagram

GitHub URL

Workflow Repository