Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Eric mitchell Direct preference optimization Training Loop

From Leeroopedia


Knowledge Sources
Domains Training, Optimization, Deep_Learning
Last Updated 2026-02-08 02:00 GMT

Overview

A training procedure that iterates over batched data, computes loss with gradient accumulation, performs optimizer steps with warmup scheduling and gradient clipping, and interleaves periodic evaluation and checkpointing.

Description

The training loop is the central execution engine for both SFT and DPO training. It orchestrates the complete training process including:

  • Gradient accumulation: Multiple microbatches are accumulated before each optimizer step, allowing effective batch sizes larger than GPU memory permits
  • Optimizer selection: Configurable optimizer (default RMSprop), chosen for memory efficiency over Adam
  • Learning rate warmup: Linear warmup schedule that ramps from 0 to the target learning rate over a configurable number of steps
  • Gradient clipping: Max-norm gradient clipping to prevent training instability
  • Evaluation interleaving: Periodic evaluation on held-out data with metric logging to wandb
  • Sample generation: Optional text generation during evaluation to qualitatively assess model behavior
  • Checkpointing: Saving model, optimizer, and scheduler state at evaluation points

The same training loop handles both SFT (negative log-likelihood on preferred responses) and DPO (preference loss on chosen/rejected pairs), switching behavior based on the loss configuration.

Usage

Use this principle when executing the training process after model loading and data pipeline setup. The training loop is the final execution step in both the SFT and DPO workflows.

Theoretical Basis

The training procedure follows standard stochastic gradient descent with enhancements:

Failed to parse (syntax error): {\displaystyle \theta_{t+1} = \theta_t - \eta_t \cdot \text{clip}(\nabla_\theta \mathcal{L}, \text{max\_norm}) }

where ηt follows a linear warmup schedule:

Failed to parse (syntax error): {\displaystyle \eta_t = \eta \cdot \min\left(1, \frac{t+1}{\text{warmup\_steps}+1}\right) }

Pseudo-code:

# Abstract training loop (NOT actual implementation)
for batch in data_iterator:
    if should_evaluate:
        run_evaluation(eval_batches)
        save_checkpoint()
    for microbatch in split(batch, accumulation_steps):
        loss = compute_loss(model, microbatch) / accumulation_steps
        loss.backward()
    clip_gradients(model, max_norm)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment