Principle:Huggingface Trl DPO Reference Model Setup

Knowledge Sources	DPO TR-DPO TRL TRL Docs
Domains	NLP, RLHF
Last Updated	2026-02-06 17:00 GMT

Overview

The reference model in Direct Preference Optimization serves as a fixed anchor that prevents the policy model from diverging too far from the original distribution during preference learning.

Description

The DPO loss function computes a ratio between the policy model's and the reference model's log probabilities for both chosen and rejected responses. The reference model defines the baseline distribution from which the policy is allowed to deviate, with the degree of deviation controlled by the beta parameter.

There are three strategies for providing reference model behavior in TRL:

1. Explicit reference model (full fine-tuning): When training the full model (no PEFT adapters), a separate copy of the pretrained model is loaded as the reference. This copy is frozen (no gradient computation) and used in evaluation mode to compute reference log probabilities. This approach doubles the memory requirement since two full copies of the model must reside in memory.

2. Implicit reference via PEFT adapters: When using parameter-efficient fine-tuning (PEFT/LoRA), the base model with adapters disabled serves as the implicit reference model. The DPOTrainer achieves this by using a context manager (null_ref_context) that temporarily disables the LoRA adapters, exposing the frozen base model underneath. This eliminates the need for a separate reference model, roughly halving memory requirements.

3. Precomputed reference log probabilities: As an alternative to keeping a reference model in memory during training, reference log probabilities can be precomputed over the entire dataset before training begins. This trades a one-time computation cost for the benefit of not needing any reference model during the training loop, further reducing memory usage.

4. Synchronized reference (TR-DPO): The TR-DPO method periodically updates the reference model as a moving average of the policy, preventing the reference from becoming stale during long training runs.

Usage

Set up a reference model when:

Running full fine-tuning DPO (no PEFT): an explicit reference model is required
Using PEFT/LoRA: set ref_model=None to use the implicit reference
Optimizing memory: enable precompute_ref_log_probs=True to avoid holding a reference model during training
Using TR-DPO: enable sync_ref_model=True to periodically update the reference

Theoretical Basis

The reference model pi_ref appears in the DPO loss through the log-probability ratios:

log(pi_theta(y|x) / pi_ref(y|x)) = log pi_theta(y|x) - log pi_ref(y|x)

The DPO objective can be understood as learning an implicit reward function:

r(x, y) = beta * log(pi_theta(y|x) / pi_ref(y|x))

The reference model ensures that the learned policy does not collapse to only generating the chosen responses while ignoring general language modeling quality. Without the KL constraint (i.e., without the reference), the policy could degenerate by assigning all probability mass to a narrow set of preferred outputs.

When using PEFT adapters, the reference behavior is obtained by disabling adapters:

pi_ref(y|x) = pi_base(y|x)           (base model without adapters)
pi_theta(y|x) = pi_base+adapter(y|x)  (base model with active adapters)

This works because the adapter parameters are the only trainable parameters. The base model weights remain frozen, so the base model naturally serves as the pre-training reference distribution.

For TR-DPO, the reference model is updated periodically:

pi_ref = alpha * pi_theta + (1 - alpha) * pi_ref_prev

where alpha is the ref_model_mixup_alpha parameter and the update happens every ref_model_sync_steps steps.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment