Workflow:Hpcaitech ColossalAI DPO Alignment
| Knowledge Sources | |
|---|---|
| Domains | LLMs, RLHF, Alignment, Distributed_Training |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
End-to-end process for aligning language models using Direct Preference Optimization (DPO) with human preference data via ColossalAI's distributed training framework.
Description
This workflow implements the Direct Preference Optimization alignment method, which directly optimizes a language model's policy using preference data without requiring a separate reward model. The process trains an actor model against a frozen reference model, using paired chosen/rejected responses to compute the DPO loss. The implementation also supports SimPO (Simple Preference Optimization) via the gamma parameter and length normalization. Training supports multiple parallelism strategies through ColossalAI's plugin system including ZeRO-2, Gemini, and 3D Parallelism.
Usage
Execute this workflow after completing supervised fine-tuning (SFT) when you have a preference dataset containing chosen/rejected response pairs and want to align the model with human preferences. DPO is preferred over PPO-based RLHF when you want a simpler training pipeline that does not require a separate reward model.
Execution Steps
Step 1: Preference Data Preparation
Prepare the preference dataset containing paired chosen and rejected responses for each prompt. The data preparation scripts convert raw preference data into tokenized Arrow format with separate fields for chosen and rejected input_ids, attention_masks, and loss_masks.
Key considerations:
- Each example must have both chosen and rejected completions
- Apply the correct conversation template for the model family
- Loss masks control which tokens contribute to the preference loss
Step 2: Environment and Model Initialization
Initialize the distributed training environment and load both the actor model (to be trained) and the reference model (frozen copy). Both models are loaded from the same SFT checkpoint. Dropout is disabled on both models to ensure deterministic log probability computation.
What happens:
- Launch distributed training with colossalai.launch_from_torch()
- Load actor model and reference model from pretrained SFT checkpoint
- Disable dropout on both models for stable training
- Apply LoRA to actor model if parameter-efficient training is desired
Step 3: Plugin and Booster Configuration
Configure the parallelism strategy and create two Booster instances: one for the actor model (with optimizer) and one for the reference model (inference only). The reference model booster does not need optimizer support.
Key considerations:
- Both actor and reference model must use the same plugin type
- Reference model runs in inference mode with no gradient computation
- ZeRO-2 is the recommended default strategy for DPO training
Step 4: Tokenizer and Dataloader Setup
Configure the tokenizer with pad_token set to eos_token and prepare the preference dataloaders using DataCollatorForPreferenceDataset, which handles batching of chosen/rejected pairs.
What happens:
- Load tokenizer with right-side padding and no automatic BOS/EOS tokens
- Create training and evaluation dataloaders with preference data collator
- Apply distributed sampling via StatefulDistributedSampler
Step 5: DPO Training Loop
Execute the DPO training loop through the DPOTrainer. Each training step computes log probabilities for both chosen and rejected responses under both the actor and reference models, then computes the DPO loss.
What happens per step:
- Forward reference model on concatenated chosen+rejected sequences (no gradients)
- Compute masked log probabilities for reference model outputs
- Forward actor model on same sequences (with gradients)
- Compute masked log probabilities for actor model outputs
- Calculate DPO loss from the four sets of log probabilities
- Backward pass and gradient accumulation
- Track chosen rewards, rejected rewards, accuracy, and margin metrics
Step 6: Evaluation and Checkpointing
Periodically evaluate on a held-out preference dataset and save model checkpoints at configurable intervals. Evaluation computes the same DPO metrics without gradient updates.
Key considerations:
- Evaluation runs after each epoch on the eval preference dataloader
- Checkpoints include full training state for resumability
- Metrics logged to TensorBoard and optionally Weights & Biases
Step 7: Model Saving
Save the final aligned model checkpoint. For LoRA training, adapter weights are merged into the base model before saving.
Key considerations:
- Final model is saved with sharded weights
- Only the actor model is saved (reference model is unchanged)