Implementation:Intel Ipex llm DPOTrainer Usage
| Knowledge Sources | |
|---|---|
| Domains | NLP, RLHF, Training |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
TRL DPOTrainer configured for preference optimization training on Intel XPU with IPEX-LLM.
Description
This is a Wrapper Doc for TRL's DPOTrainer and DPOConfig used in the context of IPEX-LLM DPO training on Intel XPU. DPOConfig extends TrainingArguments with DPO-specific parameters (beta, max_prompt_length, max_length). The DPOTrainer manages the DPO loss computation between the policy and reference models. Key Intel XPU adaptations include bf16=True and optim="adamw_hf" (paged_adamw_32bit is not supported on XPU).
External Reference
Usage
Use after loading both models and formatting the dataset. Configure DPOConfig with Intel XPU-compatible settings.
Code Reference
Source Location
- Repository: IPEX-LLM
- File: python/llm/example/GPU/LLM-Finetuning/DPO/dpo_finetuning.py
- Lines: 146-178
Signature
from trl import DPOConfig, DPOTrainer
training_args = DPOConfig(
per_device_train_batch_size: int = 4,
gradient_accumulation_steps: int = 4,
gradient_checkpointing: bool = False,
learning_rate: float = 5e-5,
lr_scheduler_type: str = "cosine",
beta: float = 0.1, # DPO-specific: KL penalty coefficient
max_prompt_length: int = 1024, # DPO-specific: max prompt tokens
max_length: int = 1536, # DPO-specific: max total tokens
max_steps: int = 200,
save_strategy: str = "no",
logging_steps: int = 1,
output_dir: str = "outputs",
optim: str = "adamw_hf", # paged_adamw_32bit not supported on XPU
warmup_steps: int = 100,
bf16: bool = True,
)
dpo_trainer = DPOTrainer(
model, # Policy model (PeftModel with LoRA)
ref_model, # Reference model (frozen)
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
dpo_trainer.train()
Import
from trl import DPOConfig, DPOTrainer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | PeftModel | Yes | Policy model with LoRA adapters |
| ref_model | PreTrainedModel | Yes | Frozen reference model |
| training_args | DPOConfig | Yes | DPO training configuration |
| train_dataset | Dataset | Yes | Formatted preference dataset (prompt, chosen, rejected) |
| tokenizer | AutoTokenizer | Yes | Model tokenizer |
| beta | float | No | KL penalty coefficient (default 0.1) |
Outputs
| Name | Type | Description |
|---|---|---|
| trained model | PeftModel | Policy model with trained LoRA weights aligned to preferences |
| artifacts | Files | Model + tokenizer saved to output_dir |
Usage Examples
from trl import DPOConfig, DPOTrainer
# Configure DPO training
training_args = DPOConfig(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=5e-5,
lr_scheduler_type="cosine",
beta=0.1,
max_prompt_length=1024,
max_length=1536,
max_steps=200,
output_dir="./dpo-output",
optim="adamw_hf",
bf16=True,
warmup_steps=100,
)
# Create DPO trainer
dpo_trainer = DPOTrainer(
model,
ref_model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
# Train
dpo_trainer.train()
# Save
dpo_trainer.model.save_pretrained("./dpo-output")
tokenizer.save_pretrained("./dpo-output")