Implementation:LLMBook zh LLMBook zh github io DPOTrainer Train
| Knowledge Sources | |
|---|---|
| Domains | NLP, Alignment, Optimization |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for Direct Preference Optimization training using TRL's DPOTrainer provided by the HuggingFace TRL library.
Description
DPOTrainer from the TRL library implements the DPO loss function and training loop. In this repository, it is configured with a policy model, a frozen reference model, a tokenizer, and a preference dataset with prompt/chosen/rejected columns. The beta parameter (default 0.1) controls the KL divergence constraint.
This is a Wrapper Doc documenting how the LLMBook repository uses the external TRL library for DPO training.
Usage
Use this after loading the policy and reference models and preparing the preference dataset.
Code Reference
Source Location
- Repository: LLMBook-zh
- File: code/8.2 DPO实践.py
- Lines: 75-84
Signature
dpo_trainer = DPOTrainer(
model=model, # Trainable policy model
ref_model=model_ref, # Frozen reference model
args=args, # Arguments with beta=0.1
tokenizer=tokenizer,
train_dataset=train_dataset, # Dataset with prompt/chosen/rejected
)
dpo_trainer.train()
dpo_trainer.save_state()
Import
from trl import DPOTrainer
External Reference
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | PreTrainedModel | Yes | Trainable policy model |
| ref_model | PreTrainedModel | Yes | Frozen reference model |
| args | TrainingArguments | Yes | Training args with beta parameter |
| tokenizer | AutoTokenizer | Yes | Tokenizer (with add_eos_token=True) |
| train_dataset | Dataset | Yes | Preference data with prompt/chosen/rejected columns |
Outputs
| Name | Type | Description |
|---|---|---|
| train() returns | TrainOutput | Training metrics |
| trainer state | Files | Saved via save_state() |
Usage Examples
from trl import DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("yulan-team/YuLan-Chat-12B-v3")
model_ref = AutoModelForCausalLM.from_pretrained("yulan-team/YuLan-Chat-12B-v3")
model_ref.eval()
for param in model_ref.parameters():
param.requires_grad = False
tokenizer = AutoTokenizer.from_pretrained(
"yulan-team/YuLan-Chat-12B-v3",
model_max_length=512,
padding_side="right",
add_eos_token=True,
)
train_dataset = get_data("train", "Anthropic/hh-rlhf")
dpo_trainer = DPOTrainer(
model=model,
ref_model=model_ref,
args=args, # beta=0.1
tokenizer=tokenizer,
train_dataset=train_dataset,
)
dpo_trainer.train()
Related Pages
Implements Principle
Requires Environment
- Environment:LLMBook_zh_LLMBook_zh_github_io_PyTorch_CUDA_GPU_Environment
- Environment:LLMBook_zh_LLMBook_zh_github_io_HuggingFace_Transformers_Stack