Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Allenai Open instruct DPO Training

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Natural Language Processing, Distributed Training
Last Updated 2026-02-07 00:00 GMT

Overview

DPO training is the end-to-end workflow for optimizing a language model's policy to align with human preferences using Direct Preference Optimization, without requiring a separately trained reward model.

Description

The DPO training loop implements a preference-based fine-tuning pipeline that takes a pre-trained language model and a dataset of preference pairs (chosen vs. rejected responses), and produces an aligned model. Unlike RLHF, which trains a reward model and then uses reinforcement learning (e.g., PPO) to optimize the policy, DPO directly optimizes the policy using a closed-form loss derived from the preference data.

The training workflow consists of the following stages:

1. Initialization:

  • Set up the distributed training environment using HuggingFace Accelerate with optional DeepSpeed ZeRO optimization (stages 0-3).
  • Load the pre-trained model and tokenizer.
  • Configure LoRA adapters if parameter-efficient fine-tuning is desired.
  • Prepare the preference dataset via tokenization and filtering.

2. Reference Logprob Caching (for DPO, DPO-Norm, WPO):

  • Before training begins, compute the reference model's log-probabilities for all training examples.
  • Cache these values to disk and GPU memory for reuse throughout training.
  • This step is skipped for SimPO, which does not require a reference model.

3. Training Loop:

  • For each batch, run the policy model forward on both chosen and rejected responses (either concatenated for efficiency or separately for memory savings).
  • Compute the DPO loss variant specified by the configuration (standard DPO, DPO-Norm, SimPO, or WPO).
  • Backpropagate gradients, clip gradient norms (if configured), and update the optimizer.
  • Track metrics including loss, implicit rewards, reward accuracy, and reward margin.

4. Checkpointing and Logging:

  • Save model checkpoints at configured intervals (step-based or epoch-based).
  • Log training metrics to Weights & Biases for experiment tracking.
  • Support resumption from checkpoints for fault tolerance.

5. Post-Training:

  • Save the final model and tokenizer.
  • Optionally push to HuggingFace Hub and launch evaluation jobs.

Usage

Use the DPO training loop when:

  • You want to align a language model using preference data.
  • You prefer a simpler training pipeline compared to RLHF (no separate reward model training).
  • You need distributed training across multiple GPUs or nodes.
  • You want to experiment with different DPO loss variants (standard, normalized, SimPO, WPO).

Theoretical Basis

The DPO training procedure optimizes:

θ*=argminθ𝔼(x,yw,yl)𝒟[(πθ,πref,x,yw,yl)]

where is the chosen loss variant. The training loop implements stochastic gradient descent on this objective:

# Pre-training: cache reference logprobs (if needed)
if loss_type needs reference model:
    reference_cache = build_reference_logprobs_cache(model, dataloader)

# Training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        # Forward pass: compute policy logprobs
        chosen_logps, rejected_logps = forward(model, batch)

        # Compute loss (dispatches to appropriate variant)
        loss = compute_loss(args, batch, chosen_logps, rejected_logps, reference_cache)

        # Backward pass and optimization
        loss.backward()
        if max_grad_norm > 0:
            clip_grad_norm(model.parameters(), max_grad_norm)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        # Log metrics
        log(loss, chosen_rewards, rejected_rewards)

    # Checkpoint
    save_checkpoint(model, optimizer, lr_scheduler)

The forward pass can be done in two modes:

  • Concatenated forward: Chosen and rejected inputs are concatenated into a single batch and processed in one model forward pass. This is more efficient for FSDP-based distributed training.
  • Separate forward: Chosen and rejected inputs are processed in two separate forward passes. This uses less peak memory but is slower.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment