Workflow:Alibaba ROLL DPO Training Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Preference_Alignment, Distributed_Training |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
End-to-end process for aligning Large Language Models to human preferences using Direct Preference Optimization (DPO) with chosen and rejected response pairs.
Description
This workflow implements the DPO training pipeline in the ROLL framework. DPO optimizes a language model's policy directly from preference data without training a separate reward model. Given pairs of chosen (preferred) and rejected responses for each prompt, the pipeline computes a contrastive loss that increases the likelihood of preferred responses relative to rejected ones, regularized by KL divergence from a frozen reference model. The pipeline supports both standard DPO and the IPO (Identity Preference Optimization) variant with optional label smoothing.
Usage
Execute this workflow when you have a preference dataset containing prompts with paired chosen and rejected responses (e.g., from human annotation or AI feedback), and you want to align a base or instruction-tuned LLM to prefer higher-quality responses without the complexity of online RL training.
Execution Steps
Step 1: Environment Setup and Configuration
Prepare the compute environment and define the Hydra YAML configuration specifying the model path, preference dataset location, DPO-specific parameters (beta, label_smoothing, IPO mode), and distributed training backend. Configure device mappings for actor training and reference model workers.
Key considerations:
- The beta parameter controls how strongly the model is constrained to stay close to the reference policy (typical values: 0.1-0.5)
- IPO mode uses a different loss formulation that avoids the log-sigmoid and may be more stable in some settings
- Label smoothing can regularize the preference signal to handle noisy annotations
Step 2: Preference Dataset Preparation
Prepare the preference dataset in JSON format with prompt, chosen response, and rejected response fields. The dataset is tokenized using the model's chat template, encoding both chosen and rejected completions with appropriate attention masks and label boundaries.
What happens:
- Each example is processed into two tokenized sequences: prompt + chosen and prompt + rejected
- Labels are masked so that loss is only computed on response tokens (not prompt tokens)
- The chosen_key and rejected_key configuration parameters map to the dataset fields
- Data is split into training and validation sets
Step 3: Distributed Worker Initialization
Launch the Ray cluster and initialize two worker groups: the actor training cluster (policy being optimized) and the reference model cluster (frozen initial policy for KL regularization). Both clusters load the same initial model weights, but only the actor's weights are updated during training.
Key considerations:
- Reference and actor workers can share GPUs using offload/reload cycles
- The reference model remains frozen throughout training and provides the baseline log probabilities
Step 4: Reference Log Probability Computation
For each training batch, compute log probabilities under the frozen reference model for both chosen and rejected responses. These reference log probabilities serve as the baseline for the DPO loss, preventing the policy from deviating too far from the initial model.
What happens:
- Chosen and rejected sequences are passed through the reference model
- Per-token log probabilities are collected and masked to response tokens only
- Results are cached for use in the DPO loss computation
Step 5: DPO Loss Computation and Policy Update
Compute the actor model's log probabilities for both chosen and rejected responses. Calculate the DPO loss as a function of the log-probability margins between chosen and rejected responses under the current policy versus the reference policy. Apply gradient updates to optimize the actor model.
Key considerations:
- Standard DPO loss: negative log-sigmoid of beta times the difference in log-probability ratios
- IPO variant replaces log-sigmoid with a squared hinge loss for potentially more stable optimization
- Gradient accumulation handles large effective batch sizes across micro-batches and DP ranks
Step 6: Validation and Checkpointing
Periodically evaluate on a held-out validation set by computing DPO loss, preference accuracy (how often the model assigns higher probability to chosen over rejected), and implicit reward margins. Save model checkpoints at configured intervals and log metrics to the tracking backend.
Key considerations:
- Preference accuracy is the primary evaluation metric (percentage of examples where chosen response has higher implicit reward)
- Validation loss trends indicate convergence and potential overfitting
- Checkpoints can be converted to HuggingFace format for deployment