Principle:Allenai Open instruct DPO Training

Knowledge Sources	DPO SimPO WPO
Domains	Machine Learning, Natural Language Processing, Distributed Training
Last Updated	2026-02-07 00:00 GMT

Overview

DPO training is the end-to-end workflow for optimizing a language model's policy to align with human preferences using Direct Preference Optimization, without requiring a separately trained reward model.

Description

The DPO training loop implements a preference-based fine-tuning pipeline that takes a pre-trained language model and a dataset of preference pairs (chosen vs. rejected responses), and produces an aligned model. Unlike RLHF, which trains a reward model and then uses reinforcement learning (e.g., PPO) to optimize the policy, DPO directly optimizes the policy using a closed-form loss derived from the preference data.

The training workflow consists of the following stages:

1. Initialization:

Set up the distributed training environment using HuggingFace Accelerate with optional DeepSpeed ZeRO optimization (stages 0-3).
Load the pre-trained model and tokenizer.
Configure LoRA adapters if parameter-efficient fine-tuning is desired.
Prepare the preference dataset via tokenization and filtering.

2. Reference Logprob Caching (for DPO, DPO-Norm, WPO):

Before training begins, compute the reference model's log-probabilities for all training examples.
Cache these values to disk and GPU memory for reuse throughout training.
This step is skipped for SimPO, which does not require a reference model.

3. Training Loop:

For each batch, run the policy model forward on both chosen and rejected responses (either concatenated for efficiency or separately for memory savings).
Compute the DPO loss variant specified by the configuration (standard DPO, DPO-Norm, SimPO, or WPO).
Backpropagate gradients, clip gradient norms (if configured), and update the optimizer.
Track metrics including loss, implicit rewards, reward accuracy, and reward margin.

4. Checkpointing and Logging:

Save model checkpoints at configured intervals (step-based or epoch-based).
Log training metrics to Weights & Biases for experiment tracking.
Support resumption from checkpoints for fault tolerance.

5. Post-Training:

Save the final model and tokenizer.
Optionally push to HuggingFace Hub and launch evaluation jobs.

Usage

Use the DPO training loop when:

You want to align a language model using preference data.
You prefer a simpler training pipeline compared to RLHF (no separate reward model training).
You need distributed training across multiple GPUs or nodes.
You want to experiment with different DPO loss variants (standard, normalized, SimPO, WPO).

Theoretical Basis

The DPO training procedure optimizes:

$θ^{*} = \arg \min_{θ} 𝔼_{(x, y_{w}, y_{l}) \sim 𝒟} [ℒ (π_{θ}, π_{ref}, x, y_{w}, y_{l})]$

where $ℒ$ is the chosen loss variant. The training loop implements stochastic gradient descent on this objective:

# Pre-training: cache reference logprobs (if needed)
if loss_type needs reference model:
    reference_cache = build_reference_logprobs_cache(model, dataloader)

# Training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        # Forward pass: compute policy logprobs
        chosen_logps, rejected_logps = forward(model, batch)

        # Compute loss (dispatches to appropriate variant)
        loss = compute_loss(args, batch, chosen_logps, rejected_logps, reference_cache)

        # Backward pass and optimization
        loss.backward()
        if max_grad_norm > 0:
            clip_grad_norm(model.parameters(), max_grad_norm)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        # Log metrics
        log(loss, chosen_rewards, rejected_rewards)

    # Checkpoint
    save_checkpoint(model, optimizer, lr_scheduler)

The forward pass can be done in two modes:

Concatenated forward: Chosen and rejected inputs are concatenated into a single batch and processed in one model forward pass. This is more efficient for FSDP-based distributed training.
Separate forward: Chosen and rejected inputs are processed in two separate forward passes. This uses less peak memory but is slower.

Related Pages

Implemented By

Implementation:Allenai_Open_instruct_DPO_Tune_Cache_Main

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment