Workflow:Alibaba ROLL DPO Training Pipeline

Knowledge Sources	Alibaba ROLL ROLL Documentation
Domains	LLMs, Preference_Alignment, Distributed_Training
Last Updated	2026-02-07 19:00 GMT

Overview

End-to-end process for aligning Large Language Models to human preferences using Direct Preference Optimization (DPO) with chosen and rejected response pairs.

Description

This workflow implements the DPO training pipeline in the ROLL framework. DPO optimizes a language model's policy directly from preference data without training a separate reward model. Given pairs of chosen (preferred) and rejected responses for each prompt, the pipeline computes a contrastive loss that increases the likelihood of preferred responses relative to rejected ones, regularized by KL divergence from a frozen reference model. The pipeline supports both standard DPO and the IPO (Identity Preference Optimization) variant with optional label smoothing.

Usage

Execute this workflow when you have a preference dataset containing prompts with paired chosen and rejected responses (e.g., from human annotation or AI feedback), and you want to align a base or instruction-tuned LLM to prefer higher-quality responses without the complexity of online RL training.

Execution Steps

Step 1: Environment Setup and Configuration

Prepare the compute environment and define the Hydra YAML configuration specifying the model path, preference dataset location, DPO-specific parameters (beta, label_smoothing, IPO mode), and distributed training backend. Configure device mappings for actor training and reference model workers.

Key considerations:

The beta parameter controls how strongly the model is constrained to stay close to the reference policy (typical values: 0.1-0.5)
IPO mode uses a different loss formulation that avoids the log-sigmoid and may be more stable in some settings
Label smoothing can regularize the preference signal to handle noisy annotations

Step 2: Preference Dataset Preparation

Prepare the preference dataset in JSON format with prompt, chosen response, and rejected response fields. The dataset is tokenized using the model's chat template, encoding both chosen and rejected completions with appropriate attention masks and label boundaries.

What happens:

Each example is processed into two tokenized sequences: prompt + chosen and prompt + rejected
Labels are masked so that loss is only computed on response tokens (not prompt tokens)
The chosen_key and rejected_key configuration parameters map to the dataset fields
Data is split into training and validation sets

Step 3: Distributed Worker Initialization

Launch the Ray cluster and initialize two worker groups: the actor training cluster (policy being optimized) and the reference model cluster (frozen initial policy for KL regularization). Both clusters load the same initial model weights, but only the actor's weights are updated during training.

Key considerations:

Reference and actor workers can share GPUs using offload/reload cycles
The reference model remains frozen throughout training and provides the baseline log probabilities

Step 4: Reference Log Probability Computation

For each training batch, compute log probabilities under the frozen reference model for both chosen and rejected responses. These reference log probabilities serve as the baseline for the DPO loss, preventing the policy from deviating too far from the initial model.

What happens:

Chosen and rejected sequences are passed through the reference model
Per-token log probabilities are collected and masked to response tokens only
Results are cached for use in the DPO loss computation

Step 5: DPO Loss Computation and Policy Update

Compute the actor model's log probabilities for both chosen and rejected responses. Calculate the DPO loss as a function of the log-probability margins between chosen and rejected responses under the current policy versus the reference policy. Apply gradient updates to optimize the actor model.

Key considerations:

Standard DPO loss: negative log-sigmoid of beta times the difference in log-probability ratios
IPO variant replaces log-sigmoid with a squared hinge loss for potentially more stable optimization
Gradient accumulation handles large effective batch sizes across micro-batches and DP ranks

Step 6: Validation and Checkpointing

Periodically evaluate on a held-out validation set by computing DPO loss, preference accuracy (how often the model assigns higher probability to chosen over rejected), and implicit reward margins. Save model checkpoints at configured intervals and log metrics to the tracking backend.

Key considerations:

Preference accuracy is the primary evaluation metric (percentage of examples where chosen response has higher implicit reward)
Validation loss trends indicate convergence and potential overfitting
Checkpoints can be converted to HuggingFace format for deployment

Execution Diagram

GitHub URL

Workflow Repository