Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Trl DPOTrainer Init Train

From Leeroopedia


Knowledge Sources
Domains NLP, RLHF
Last Updated 2026-02-06 17:00 GMT

Overview

Concrete tool for initializing the DPO trainer and executing the preference optimization training loop, provided by the TRL library.

Description

DPOTrainer is the main trainer class for Direct Preference Optimization, inheriting from BaseTrainer (which extends Hugging Face's Trainer). Its __init__ method orchestrates:

  1. Argument resolution: If no DPOConfig is provided, creates a default one.
  2. Model preparation: Handles string-based model loading via create_model_from_path, validates that model and ref_model are not the same object, and resolves processing class (tokenizer or processor).
  3. PEFT wrapping: If a peft_config is provided, wraps the model with get_peft_model, handling quantized model preparation and bf16 casting.
  4. Reference model resolution: Uses the explicit ref_model, or sets to None for PEFT/precompute scenarios, or creates a deep copy via create_reference_model.
  5. Dropout disabling: Disables dropout in both policy and reference models for stable DPO training.
  6. Data collator setup: Defaults to DataCollatorForPreference if none is provided.
  7. Dataset preparation: Applies prompt extraction, chat template, and tokenization to train and eval datasets.
  8. Distributed training preparation: Prepares the reference model for DeepSpeed, FSDP, or standard Accelerate.
  9. TR-DPO callback: Registers SyncRefModelCallback if reference model synchronization is enabled.

The compute_loss method is the training loop entry point called by the parent Trainer:

  1. Calls get_batch_loss_metrics which runs the concatenated forward pass, computes reference log probs, and calculates the DPO loss for all configured loss types
  2. Returns the scalar loss (and optionally metrics) for gradient computation

The train method is inherited from the Transformers Trainer and handles the full training loop with gradient accumulation, mixed precision, distributed training, checkpointing, and logging.

Usage

Use DPOTrainer when:

  • Training a model with DPO from a preference dataset
  • You need the full Hugging Face Trainer ecosystem (logging, checkpointing, distributed training)
  • Combining DPO with PEFT/LoRA for parameter-efficient alignment
  • Running multi-loss (MPO) training with combined objectives

Code Reference

Source Location

  • Repository: TRL
  • File: trl/trainer/dpo_trainer.py
  • __init__: lines 271-569
  • compute_loss: lines 1830-1851
  • get_batch_loss_metrics: lines 1742-1828
  • dpo_loss: lines 1039-1260
  • concatenated_forward: lines 1494-1740

Signature

class DPOTrainer(BaseTrainer):
    _tag_names = ["trl", "dpo"]
    _name = "DPO"

    def __init__(
        self,
        model: str | nn.Module | PreTrainedModel,
        ref_model: PreTrainedModel | nn.Module | str | None = None,
        args: DPOConfig | None = None,
        data_collator: DataCollator | None = None,
        train_dataset: Dataset | IterableDataset | None = None,
        eval_dataset: Dataset | IterableDataset | dict[str, Dataset | IterableDataset] | None = None,
        processing_class: PreTrainedTokenizerBase | BaseImageProcessor | FeatureExtractionMixin | ProcessorMixin | None = None,
        compute_metrics: Callable[[EvalLoopOutput], dict] | None = None,
        callbacks: list[TrainerCallback] | None = None,
        optimizers: tuple[torch.optim.Optimizer | None, torch.optim.lr_scheduler.LambdaLR | None] = (None, None),
        optimizer_cls_and_kwargs: tuple[type[torch.optim.Optimizer], dict[str, Any]] | None = None,
        preprocess_logits_for_metrics: Callable[[torch.Tensor, torch.Tensor], torch.Tensor] | None = None,
        peft_config: PeftConfig | None = None,
    ):

    def compute_loss(
        self,
        model: PreTrainedModel | nn.Module,
        inputs: dict[str, torch.Tensor | Any],
        return_outputs=False,
        num_items_in_batch=None,
    ) -> torch.Tensor | tuple[torch.Tensor, dict[str, float]]:

    def dpo_loss(
        self,
        chosen_logps: torch.FloatTensor,
        rejected_logps: torch.FloatTensor,
        ref_chosen_logps: torch.FloatTensor,
        ref_rejected_logps: torch.FloatTensor,
        loss_type: str = "sigmoid",
        model_output: dict[str, torch.FloatTensor] = None,
    ) -> tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:

    def train(self) -> TrainOutput:  # inherited from Trainer

Import

from trl import DPOTrainer, DPOConfig

I/O Contract

Inputs

Name Type Required Description
model str or PreTrainedModel Yes Policy model: a model ID string or a pretrained model instance
ref_model PreTrainedModel or None No Reference model; None when using PEFT or precomputed log probs
args DPOConfig No Training configuration with DPO-specific hyperparameters
train_dataset Dataset or IterableDataset Yes Preference dataset with prompt, chosen, rejected columns
eval_dataset Dataset or dict or None No Evaluation dataset(s) with the same format as train_dataset
processing_class PreTrainedTokenizerBase or ProcessorMixin or None No Tokenizer or processor; auto-loaded from model if None
peft_config PeftConfig or None No PEFT configuration for parameter-efficient training
callbacks list[TrainerCallback] or None No Custom training callbacks
optimizers tuple No (default: (None, None)) Custom optimizer and scheduler; defaults to AdamW with linear warmup

Outputs

Name Type Description
DPOTrainer instance DPOTrainer Initialized trainer ready for .train() and .evaluate()
TrainOutput TrainOutput Training results including global_step, training_loss, and metrics (returned by .train())
Training metrics dict[str, float] Per-step metrics: loss, rewards/chosen, rewards/rejected, rewards/margins, rewards/accuracies, logps/chosen, logps/rejected, logits/chosen, logits/rejected

Usage Examples

# Example 1: Basic DPO training (full fine-tuning)
from datasets import load_dataset
from transformers import AutoModelForCausalLM
from trl import DPOConfig, DPOTrainer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
ref_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized")

training_args = DPOConfig(
    output_dir="Qwen2-0.5B-DPO",
    beta=0.1,
    loss_type=["sigmoid"],
    learning_rate=5e-7,
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
)

trainer.train()
# Example 2: DPO with LoRA (parameter-efficient)
from peft import LoraConfig
from trl import DPOConfig, DPOTrainer

peft_config = LoraConfig(
    r=32,
    lora_alpha=16,
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

training_args = DPOConfig(
    output_dir="Qwen2-0.5B-DPO-LoRA",
    beta=0.1,
    learning_rate=5e-6,
    num_train_epochs=1,
)

trainer = DPOTrainer(
    model=model,
    ref_model=None,  # implicit reference via adapter disabling
    args=training_args,
    train_dataset=dataset["train"],
    peft_config=peft_config,
)

trainer.train()
# Example 3: Multi-loss (MPO-style) training
training_args = DPOConfig(
    output_dir="Qwen2-0.5B-MPO",
    beta=0.1,
    loss_type=["sigmoid", "sft"],
    loss_weights=[0.8, 1.0],
    learning_rate=5e-7,
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=training_args,
    train_dataset=dataset["train"],
)

trainer.train()

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment