Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Intel Ipex llm DPOTrainer Usage

From Leeroopedia


Knowledge Sources
Domains NLP, RLHF, Training
Last Updated 2026-02-09 00:00 GMT

Overview

TRL DPOTrainer configured for preference optimization training on Intel XPU with IPEX-LLM.

Description

This is a Wrapper Doc for TRL's DPOTrainer and DPOConfig used in the context of IPEX-LLM DPO training on Intel XPU. DPOConfig extends TrainingArguments with DPO-specific parameters (beta, max_prompt_length, max_length). The DPOTrainer manages the DPO loss computation between the policy and reference models. Key Intel XPU adaptations include bf16=True and optim="adamw_hf" (paged_adamw_32bit is not supported on XPU).

External Reference

Usage

Use after loading both models and formatting the dataset. Configure DPOConfig with Intel XPU-compatible settings.

Code Reference

Source Location

  • Repository: IPEX-LLM
  • File: python/llm/example/GPU/LLM-Finetuning/DPO/dpo_finetuning.py
  • Lines: 146-178

Signature

from trl import DPOConfig, DPOTrainer

training_args = DPOConfig(
    per_device_train_batch_size: int = 4,
    gradient_accumulation_steps: int = 4,
    gradient_checkpointing: bool = False,
    learning_rate: float = 5e-5,
    lr_scheduler_type: str = "cosine",
    beta: float = 0.1,              # DPO-specific: KL penalty coefficient
    max_prompt_length: int = 1024,   # DPO-specific: max prompt tokens
    max_length: int = 1536,          # DPO-specific: max total tokens
    max_steps: int = 200,
    save_strategy: str = "no",
    logging_steps: int = 1,
    output_dir: str = "outputs",
    optim: str = "adamw_hf",         # paged_adamw_32bit not supported on XPU
    warmup_steps: int = 100,
    bf16: bool = True,
)

dpo_trainer = DPOTrainer(
    model,           # Policy model (PeftModel with LoRA)
    ref_model,       # Reference model (frozen)
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)
dpo_trainer.train()

Import

from trl import DPOConfig, DPOTrainer

I/O Contract

Inputs

Name Type Required Description
model PeftModel Yes Policy model with LoRA adapters
ref_model PreTrainedModel Yes Frozen reference model
training_args DPOConfig Yes DPO training configuration
train_dataset Dataset Yes Formatted preference dataset (prompt, chosen, rejected)
tokenizer AutoTokenizer Yes Model tokenizer
beta float No KL penalty coefficient (default 0.1)

Outputs

Name Type Description
trained model PeftModel Policy model with trained LoRA weights aligned to preferences
artifacts Files Model + tokenizer saved to output_dir

Usage Examples

from trl import DPOConfig, DPOTrainer

# Configure DPO training
training_args = DPOConfig(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    beta=0.1,
    max_prompt_length=1024,
    max_length=1536,
    max_steps=200,
    output_dir="./dpo-output",
    optim="adamw_hf",
    bf16=True,
    warmup_steps=100,
)

# Create DPO trainer
dpo_trainer = DPOTrainer(
    model,
    ref_model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

# Train
dpo_trainer.train()

# Save
dpo_trainer.model.save_pretrained("./dpo-output")
tokenizer.save_pretrained("./dpo-output")

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment