Principle:Eric mitchell Direct preference optimization Training Loop

Knowledge Sources	Direct Preference Optimization PyTorch Training
Domains	Training, Optimization, Deep_Learning
Last Updated	2026-02-08 02:00 GMT

Overview

A training procedure that iterates over batched data, computes loss with gradient accumulation, performs optimizer steps with warmup scheduling and gradient clipping, and interleaves periodic evaluation and checkpointing.

Description

The training loop is the central execution engine for both SFT and DPO training. It orchestrates the complete training process including:

Gradient accumulation: Multiple microbatches are accumulated before each optimizer step, allowing effective batch sizes larger than GPU memory permits
Optimizer selection: Configurable optimizer (default RMSprop), chosen for memory efficiency over Adam
Learning rate warmup: Linear warmup schedule that ramps from 0 to the target learning rate over a configurable number of steps
Gradient clipping: Max-norm gradient clipping to prevent training instability
Evaluation interleaving: Periodic evaluation on held-out data with metric logging to wandb
Sample generation: Optional text generation during evaluation to qualitatively assess model behavior
Checkpointing: Saving model, optimizer, and scheduler state at evaluation points

The same training loop handles both SFT (negative log-likelihood on preferred responses) and DPO (preference loss on chosen/rejected pairs), switching behavior based on the loss configuration.

Usage

Use this principle when executing the training process after model loading and data pipeline setup. The training loop is the final execution step in both the SFT and DPO workflows.

Theoretical Basis

The training procedure follows standard stochastic gradient descent with enhancements:

Failed to parse (syntax error): {\displaystyle \theta_{t+1} = \theta_t - \eta_t \cdot \text{clip}(\nabla_\theta \mathcal{L}, \text{max\_norm}) }

where $η_{t}$ follows a linear warmup schedule:

Failed to parse (syntax error): {\displaystyle \eta_t = \eta \cdot \min\left(1, \frac{t+1}{\text{warmup\_steps}+1}\right) }

Pseudo-code:

# Abstract training loop (NOT actual implementation)
for batch in data_iterator:
    if should_evaluate:
        run_evaluation(eval_batches)
        save_checkpoint()
    for microbatch in split(batch, accumulation_steps):
        loss = compute_loss(model, microbatch) / accumulation_steps
        loss.backward()
    clip_gradients(model, max_norm)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

Related Pages

Implemented By

Implementation:Eric_mitchell_Direct_preference_optimization_BasicTrainer_Train

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment