Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:LLMBook zh LLMBook zh github io DPOTrainer Train

From Leeroopedia


Knowledge Sources
Domains NLP, Alignment, Optimization
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for Direct Preference Optimization training using TRL's DPOTrainer provided by the HuggingFace TRL library.

Description

DPOTrainer from the TRL library implements the DPO loss function and training loop. In this repository, it is configured with a policy model, a frozen reference model, a tokenizer, and a preference dataset with prompt/chosen/rejected columns. The beta parameter (default 0.1) controls the KL divergence constraint.

This is a Wrapper Doc documenting how the LLMBook repository uses the external TRL library for DPO training.

Usage

Use this after loading the policy and reference models and preparing the preference dataset.

Code Reference

Source Location

  • Repository: LLMBook-zh
  • File: code/8.2 DPO实践.py
  • Lines: 75-84

Signature

dpo_trainer = DPOTrainer(
    model=model,              # Trainable policy model
    ref_model=model_ref,      # Frozen reference model
    args=args,                # Arguments with beta=0.1
    tokenizer=tokenizer,
    train_dataset=train_dataset,  # Dataset with prompt/chosen/rejected
)
dpo_trainer.train()
dpo_trainer.save_state()

Import

from trl import DPOTrainer

External Reference

I/O Contract

Inputs

Name Type Required Description
model PreTrainedModel Yes Trainable policy model
ref_model PreTrainedModel Yes Frozen reference model
args TrainingArguments Yes Training args with beta parameter
tokenizer AutoTokenizer Yes Tokenizer (with add_eos_token=True)
train_dataset Dataset Yes Preference data with prompt/chosen/rejected columns

Outputs

Name Type Description
train() returns TrainOutput Training metrics
trainer state Files Saved via save_state()

Usage Examples

from trl import DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("yulan-team/YuLan-Chat-12B-v3")
model_ref = AutoModelForCausalLM.from_pretrained("yulan-team/YuLan-Chat-12B-v3")
model_ref.eval()
for param in model_ref.parameters():
    param.requires_grad = False

tokenizer = AutoTokenizer.from_pretrained(
    "yulan-team/YuLan-Chat-12B-v3",
    model_max_length=512,
    padding_side="right",
    add_eos_token=True,
)

train_dataset = get_data("train", "Anthropic/hh-rlhf")

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=model_ref,
    args=args,  # beta=0.1
    tokenizer=tokenizer,
    train_dataset=train_dataset,
)
dpo_trainer.train()

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment