Implementation:Intel Ipex llm DPOTrainer Usage

Knowledge Sources	IPEX-LLM TRL Documentation
Domains	NLP, RLHF, Training
Last Updated	2026-02-09 00:00 GMT

Overview

TRL DPOTrainer configured for preference optimization training on Intel XPU with IPEX-LLM.

Description

This is a Wrapper Doc for TRL's DPOTrainer and DPOConfig used in the context of IPEX-LLM DPO training on Intel XPU. DPOConfig extends TrainingArguments with DPO-specific parameters (beta, max_prompt_length, max_length). The DPOTrainer manages the DPO loss computation between the policy and reference models. Key Intel XPU adaptations include bf16=True and optim="adamw_hf" (paged_adamw_32bit is not supported on XPU).

External Reference

TRL DPOTrainer Documentation

Usage

Use after loading both models and formatting the dataset. Configure DPOConfig with Intel XPU-compatible settings.

Code Reference

Source Location

Repository: IPEX-LLM
File: python/llm/example/GPU/LLM-Finetuning/DPO/dpo_finetuning.py
Lines: 146-178

Signature

from trl import DPOConfig, DPOTrainer

training_args = DPOConfig(
    per_device_train_batch_size: int = 4,
    gradient_accumulation_steps: int = 4,
    gradient_checkpointing: bool = False,
    learning_rate: float = 5e-5,
    lr_scheduler_type: str = "cosine",
    beta: float = 0.1,              # DPO-specific: KL penalty coefficient
    max_prompt_length: int = 1024,   # DPO-specific: max prompt tokens
    max_length: int = 1536,          # DPO-specific: max total tokens
    max_steps: int = 200,
    save_strategy: str = "no",
    logging_steps: int = 1,
    output_dir: str = "outputs",
    optim: str = "adamw_hf",         # paged_adamw_32bit not supported on XPU
    warmup_steps: int = 100,
    bf16: bool = True,
)

dpo_trainer = DPOTrainer(
    model,           # Policy model (PeftModel with LoRA)
    ref_model,       # Reference model (frozen)
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)
dpo_trainer.train()

Import

from trl import DPOConfig, DPOTrainer

I/O Contract

Inputs

Name	Type	Required	Description
model	PeftModel	Yes	Policy model with LoRA adapters
ref_model	PreTrainedModel	Yes	Frozen reference model
training_args	DPOConfig	Yes	DPO training configuration
train_dataset	Dataset	Yes	Formatted preference dataset (prompt, chosen, rejected)
tokenizer	AutoTokenizer	Yes	Model tokenizer
beta	float	No	KL penalty coefficient (default 0.1)

Outputs

Name	Type	Description
trained model	PeftModel	Policy model with trained LoRA weights aligned to preferences
artifacts	Files	Model + tokenizer saved to output_dir

Usage Examples

from trl import DPOConfig, DPOTrainer

# Configure DPO training
training_args = DPOConfig(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    beta=0.1,
    max_prompt_length=1024,
    max_length=1536,
    max_steps=200,
    output_dir="./dpo-output",
    optim="adamw_hf",
    bf16=True,
    warmup_steps=100,
)

# Create DPO trainer
dpo_trainer = DPOTrainer(
    model,
    ref_model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

# Train
dpo_trainer.train()

# Save
dpo_trainer.model.save_pretrained("./dpo-output")
tokenizer.save_pretrained("./dpo-output")

Related Pages

Implements Principle

Principle:Intel_Ipex_llm_DPO_Training

Requires Environment

Environment:Intel_Ipex_llm_XPU_Finetuning_Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment