Implementation:LLMBook zh LLMBook zh github io DPOTrainer Train

Knowledge Sources	LLMBook-zh TRL DPOTrainer
Domains	NLP, Alignment, Optimization
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for Direct Preference Optimization training using TRL's DPOTrainer provided by the HuggingFace TRL library.

Description

DPOTrainer from the TRL library implements the DPO loss function and training loop. In this repository, it is configured with a policy model, a frozen reference model, a tokenizer, and a preference dataset with prompt/chosen/rejected columns. The beta parameter (default 0.1) controls the KL divergence constraint.

This is a Wrapper Doc documenting how the LLMBook repository uses the external TRL library for DPO training.

Usage

Use this after loading the policy and reference models and preparing the preference dataset.

Code Reference

Source Location

Repository: LLMBook-zh
File: code/8.2 DPO实践.py
Lines: 75-84

Signature

dpo_trainer = DPOTrainer(
    model=model,              # Trainable policy model
    ref_model=model_ref,      # Frozen reference model
    args=args,                # Arguments with beta=0.1
    tokenizer=tokenizer,
    train_dataset=train_dataset,  # Dataset with prompt/chosen/rejected
)
dpo_trainer.train()
dpo_trainer.save_state()

Import

from trl import DPOTrainer

External Reference

TRL DPOTrainer Documentation

I/O Contract

Inputs

Name	Type	Required	Description
model	PreTrainedModel	Yes	Trainable policy model
ref_model	PreTrainedModel	Yes	Frozen reference model
args	TrainingArguments	Yes	Training args with beta parameter
tokenizer	AutoTokenizer	Yes	Tokenizer (with add_eos_token=True)
train_dataset	Dataset	Yes	Preference data with prompt/chosen/rejected columns

Outputs

Name	Type	Description
train() returns	TrainOutput	Training metrics
trainer state	Files	Saved via save_state()

Usage Examples

from trl import DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("yulan-team/YuLan-Chat-12B-v3")
model_ref = AutoModelForCausalLM.from_pretrained("yulan-team/YuLan-Chat-12B-v3")
model_ref.eval()
for param in model_ref.parameters():
    param.requires_grad = False

tokenizer = AutoTokenizer.from_pretrained(
    "yulan-team/YuLan-Chat-12B-v3",
    model_max_length=512,
    padding_side="right",
    add_eos_token=True,
)

train_dataset = get_data("train", "Anthropic/hh-rlhf")

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=model_ref,
    args=args,  # beta=0.1
    tokenizer=tokenizer,
    train_dataset=train_dataset,
)
dpo_trainer.train()

Related Pages

Implements Principle

Principle:LLMBook_zh_LLMBook_zh_github_io_Direct_Preference_Optimization

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment