Workflow:LLMBook zh LLMBook zh github io DPO Alignment

Knowledge Sources	LLMBook-zh TRL Library Direct Preference Optimization
Domains	LLMs, Alignment, RLHF
Last Updated	2026-02-08 04:30 GMT

Overview

End-to-end human alignment workflow using reward model training and Direct Preference Optimization (DPO) to align a language model with human preferences without reinforcement learning.

Description

This workflow covers the alignment stage of LLM development, where a supervised fine-tuned model is further trained to produce outputs that align with human preferences. Two complementary approaches are implemented: (1) a reward model that learns to score outputs based on human preference comparisons, and (2) Direct Preference Optimization (DPO), which directly optimizes the language model policy using preference pairs without requiring a separate reward model or RL training loop. The reward model uses a contrastive loss on chosen vs. rejected response pairs plus a language modeling regularization term. DPO reformulates the RLHF objective as a simple classification loss on preference pairs, using a reference model to prevent the policy from diverging too far from the original behavior.

Usage

Execute this workflow after supervised fine-tuning when you have a preference dataset containing chosen (preferred) and rejected (dispreferred) response pairs for the same prompts. This is the standard approach for making a model's outputs more helpful, harmless, and honest according to human judgments. DPO is preferred over PPO-based RLHF when you want simpler training with fewer hyperparameters and no need for a separate reward model at training time.

Execution Steps

Step 1: Reward Model Architecture

Design a reward model by extending the base language model with a scalar reward head. The reward model inherits the full LLM architecture and adds a linear projection layer that maps the final hidden state to a single scalar reward value. The model computes rewards for both chosen and rejected responses, then optimizes a contrastive loss (binary cross-entropy on the reward difference) that encourages higher rewards for preferred responses. A language modeling loss on the chosen response serves as a regularization term to prevent the reward model from forgetting language understanding.

Key considerations:

The reward head is a single Linear(hidden_size, 1) layer with no bias
The contrastive loss operates on reward differences: loss = BCE(reward_chosen - reward_rejected)
The LM regularization loss uses standard cross-entropy on the chosen response
Final loss combines both terms: loss = rm_loss + lm_loss

Step 2: Preference Data Preparation

Process the preference dataset into the format required by the DPO trainer. Each example in the dataset contains a prompt with both a chosen and a rejected completion. The data processing extracts the prompt by finding the last assistant turn marker, then separates the chosen and rejected responses. The result is a dataset with three fields per example: prompt, chosen response, and rejected response.

Key considerations:

The data format follows the Anthropic HH-RLHF structure with "Human:" and "Assistant:" markers
The prompt is extracted by finding the last occurrence of the assistant turn delimiter
Both chosen and rejected responses are stripped of the shared prompt prefix
The dataset is loaded using HuggingFace datasets library for efficient processing

Step 3: Model and Reference Model Loading

Load two copies of the supervised fine-tuned model: the policy model (which will be trained) and the reference model (which remains frozen). The reference model serves as a baseline to prevent the policy from diverging too far during DPO optimization. The reference model is set to evaluation mode with all gradients disabled. Both models share the same architecture and initial weights.

Key considerations:

The reference model must be an exact copy of the initial policy model
All reference model parameters are frozen (requires_grad = False)
The reference model is set to eval mode to disable dropout
Both models use the same tokenizer with EOS token enabled and right-side padding

Step 4: DPO Training

Train the policy model using the DPO objective implemented by TRL's DPOTrainer. The DPO loss computes log-probability ratios between the policy and reference models for both chosen and rejected responses, then optimizes a classification loss that increases the relative probability of chosen responses. The beta hyperparameter controls the strength of the KL divergence constraint from the reference model (higher beta means less divergence).

Key considerations:

The beta parameter (default 0.1) controls the tradeoff between reward maximization and staying close to the reference policy
The DPOTrainer handles log-probability computation, loss calculation, and optimization automatically
BF16 mixed precision is used for memory efficiency
The trained model and state are saved after training completes

Execution Diagram

GitHub URL

Workflow Repository