Workflow:Hpcaitech ColossalAI DPO Alignment

Knowledge Sources	ColossalAI DPO Paper ColossalChat Examples README
Domains	LLMs, RLHF, Alignment, Distributed_Training
Last Updated	2026-02-09 03:00 GMT

Overview

End-to-end process for aligning language models using Direct Preference Optimization (DPO) with human preference data via ColossalAI's distributed training framework.

Description

This workflow implements the Direct Preference Optimization alignment method, which directly optimizes a language model's policy using preference data without requiring a separate reward model. The process trains an actor model against a frozen reference model, using paired chosen/rejected responses to compute the DPO loss. The implementation also supports SimPO (Simple Preference Optimization) via the gamma parameter and length normalization. Training supports multiple parallelism strategies through ColossalAI's plugin system including ZeRO-2, Gemini, and 3D Parallelism.

Usage

Execute this workflow after completing supervised fine-tuning (SFT) when you have a preference dataset containing chosen/rejected response pairs and want to align the model with human preferences. DPO is preferred over PPO-based RLHF when you want a simpler training pipeline that does not require a separate reward model.

Execution Steps

Step 1: Preference Data Preparation

Prepare the preference dataset containing paired chosen and rejected responses for each prompt. The data preparation scripts convert raw preference data into tokenized Arrow format with separate fields for chosen and rejected input_ids, attention_masks, and loss_masks.

Key considerations:

Each example must have both chosen and rejected completions
Apply the correct conversation template for the model family
Loss masks control which tokens contribute to the preference loss

Step 2: Environment and Model Initialization

Initialize the distributed training environment and load both the actor model (to be trained) and the reference model (frozen copy). Both models are loaded from the same SFT checkpoint. Dropout is disabled on both models to ensure deterministic log probability computation.

What happens:

Launch distributed training with colossalai.launch_from_torch()
Load actor model and reference model from pretrained SFT checkpoint
Disable dropout on both models for stable training
Apply LoRA to actor model if parameter-efficient training is desired

Step 3: Plugin and Booster Configuration

Configure the parallelism strategy and create two Booster instances: one for the actor model (with optimizer) and one for the reference model (inference only). The reference model booster does not need optimizer support.

Key considerations:

Both actor and reference model must use the same plugin type
Reference model runs in inference mode with no gradient computation
ZeRO-2 is the recommended default strategy for DPO training

Step 4: Tokenizer and Dataloader Setup

Configure the tokenizer with pad_token set to eos_token and prepare the preference dataloaders using DataCollatorForPreferenceDataset, which handles batching of chosen/rejected pairs.

What happens:

Load tokenizer with right-side padding and no automatic BOS/EOS tokens
Create training and evaluation dataloaders with preference data collator
Apply distributed sampling via StatefulDistributedSampler

Step 5: DPO Training Loop

Execute the DPO training loop through the DPOTrainer. Each training step computes log probabilities for both chosen and rejected responses under both the actor and reference models, then computes the DPO loss.

What happens per step:

Forward reference model on concatenated chosen+rejected sequences (no gradients)
Compute masked log probabilities for reference model outputs
Forward actor model on same sequences (with gradients)
Compute masked log probabilities for actor model outputs
Calculate DPO loss from the four sets of log probabilities
Backward pass and gradient accumulation
Track chosen rewards, rejected rewards, accuracy, and margin metrics

Step 6: Evaluation and Checkpointing

Periodically evaluate on a held-out preference dataset and save model checkpoints at configurable intervals. Evaluation computes the same DPO metrics without gradient updates.

Key considerations:

Evaluation runs after each epoch on the eval preference dataloader
Checkpoints include full training state for resumability
Metrics logged to TensorBoard and optionally Weights & Biases

Step 7: Model Saving

Save the final aligned model checkpoint. For LoRA training, adapter weights are merged into the base model before saving.

Key considerations:

Final model is saved with sharded weights
Only the actor model is saved (reference model is unchanged)

Execution Diagram

GitHub URL

Workflow Repository