Principle:Eric mitchell Direct preference optimization Training Loop
| Knowledge Sources | |
|---|---|
| Domains | Training, Optimization, Deep_Learning |
| Last Updated | 2026-02-08 02:00 GMT |
Overview
A training procedure that iterates over batched data, computes loss with gradient accumulation, performs optimizer steps with warmup scheduling and gradient clipping, and interleaves periodic evaluation and checkpointing.
Description
The training loop is the central execution engine for both SFT and DPO training. It orchestrates the complete training process including:
- Gradient accumulation: Multiple microbatches are accumulated before each optimizer step, allowing effective batch sizes larger than GPU memory permits
- Optimizer selection: Configurable optimizer (default RMSprop), chosen for memory efficiency over Adam
- Learning rate warmup: Linear warmup schedule that ramps from 0 to the target learning rate over a configurable number of steps
- Gradient clipping: Max-norm gradient clipping to prevent training instability
- Evaluation interleaving: Periodic evaluation on held-out data with metric logging to wandb
- Sample generation: Optional text generation during evaluation to qualitatively assess model behavior
- Checkpointing: Saving model, optimizer, and scheduler state at evaluation points
The same training loop handles both SFT (negative log-likelihood on preferred responses) and DPO (preference loss on chosen/rejected pairs), switching behavior based on the loss configuration.
Usage
Use this principle when executing the training process after model loading and data pipeline setup. The training loop is the final execution step in both the SFT and DPO workflows.
Theoretical Basis
The training procedure follows standard stochastic gradient descent with enhancements:
Failed to parse (syntax error): {\displaystyle \theta_{t+1} = \theta_t - \eta_t \cdot \text{clip}(\nabla_\theta \mathcal{L}, \text{max\_norm}) }
where follows a linear warmup schedule:
Failed to parse (syntax error): {\displaystyle \eta_t = \eta \cdot \min\left(1, \frac{t+1}{\text{warmup\_steps}+1}\right) }
Pseudo-code:
# Abstract training loop (NOT actual implementation)
for batch in data_iterator:
if should_evaluate:
run_evaluation(eval_batches)
save_checkpoint()
for microbatch in split(batch, accumulation_steps):
loss = compute_loss(model, microbatch) / accumulation_steps
loss.backward()
clip_gradients(model, max_norm)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
Related Pages
Implemented By
Uses Heuristic
- Heuristic:Eric_mitchell_Direct_preference_optimization_FSDP_Mixed_Precision_BFloat16
- Heuristic:Eric_mitchell_Direct_preference_optimization_Activation_Checkpointing_Memory
- Heuristic:Eric_mitchell_Direct_preference_optimization_RMSprop_Over_Adam
- Heuristic:Eric_mitchell_Direct_preference_optimization_FSDP_Batch_Size_Per_GPU