Implementation:Huggingface Trl DPO Reference Model Pattern
| Knowledge Sources | |
|---|---|
| Domains | NLP, RLHF |
| Last Updated | 2026-02-06 17:00 GMT |
Overview
Concrete tool for setting up the DPO reference model, either as an explicit separate model or implicitly via PEFT adapter disabling, provided by the TRL library.
Description
The reference model setup in TRL's DPO pipeline follows a conditional pattern based on whether PEFT (LoRA) is being used:
- Without PEFT: A second copy of the pretrained model is loaded with identical configuration (same
model_kwargs) and passed asref_modelto the DPOTrainer. Inside the trainer, this model is prepared for distributed training via DeepSpeed, FSDP, or Accelerate'sprepare_modelin evaluation mode.
- With PEFT: The reference model is set to
None. The DPOTrainer detects this and uses thenull_ref_contextcontext manager, which callsmodel.disable_adapter()on the PEFT model to temporarily expose the frozen base model. This is done viaself.accelerator.unwrap_model(self.model).disable_adapter().
- With precomputed log probs: If
precompute_ref_log_probs=True, the reference model is used only during the precomputation phase and then discarded.
The DPOTrainer's __init__ method (lines 386-392) handles the resolution: if a ref_model is explicitly provided, it is used directly; if the model is a PEFT model or precompute is enabled, ref_model is set to None; otherwise, a deep copy of the model is created via create_reference_model.
Usage
Use the explicit reference model pattern (without PEFT) when:
- Full fine-tuning all model parameters
- You have sufficient GPU memory for two model copies
- You need a separate reference that will not be affected by training
Use the implicit reference pattern (with PEFT) when:
- Using LoRA or other PEFT methods
- Memory is constrained and you cannot hold two full model copies
- The base model serves as an adequate reference distribution
Code Reference
Source Location
- Repository: TRL
- File (script):
trl/scripts/dpo.py(lines 108-114) - File (trainer resolution):
trl/trainer/dpo_trainer.py(lines 386-392) - File (null_ref_context):
trl/trainer/dpo_trainer.py(lines 933-945)
Signature
# Reference model setup pattern from trl/scripts/dpo.py
peft_config = get_peft_config(model_args)
if peft_config is None:
ref_model = AutoModelForCausalLM.from_pretrained(
model_args.model_name_or_path,
trust_remote_code=model_args.trust_remote_code,
**model_kwargs,
)
else:
ref_model = None # DPOTrainer disables adapters for reference behavior
# DPOTrainer reference model resolution (dpo_trainer.py lines 386-392)
if ref_model:
self.ref_model = ref_model
elif self.is_peft_model or args.precompute_ref_log_probs:
# The model with adapters turned off will be used as the reference model
self.ref_model = None
else:
self.ref_model = create_reference_model(model)
# Context manager for implicit reference (dpo_trainer.py lines 933-945)
@contextmanager
def null_ref_context(self):
"""Context manager for handling null reference model (peft adapter manipulation)."""
with (
self.accelerator.unwrap_model(self.model).disable_adapter()
if self.is_peft_model and not self.ref_adapter_name
else nullcontext()
):
if self.ref_adapter_name:
self.model.set_adapter(self.ref_adapter_name)
yield
if self.ref_adapter_name:
self.model.set_adapter(self.model_adapter_name or "default")
Import
from transformers import AutoModelForCausalLM
from trl import get_peft_config, ModelConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_args | ModelConfig |
Yes | Model configuration including model_name_or_path, use_peft, lora_r, etc. |
| peft_config | PeftConfig or None |
Yes | PEFT configuration; None triggers explicit reference model loading |
| model_kwargs | dict |
Yes | Keyword arguments (revision, attn_implementation, dtype, quantization) for model loading |
| trust_remote_code | bool |
No | Whether to trust custom model code from the Hub |
Outputs
| Name | Type | Description |
|---|---|---|
| ref_model | PreTrainedModel or None |
The reference model (explicit copy) or None (implicit via PEFT adapter disabling) |
Usage Examples
# Example 1: Explicit reference model (full fine-tuning, no PEFT)
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
ref_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=training_args,
train_dataset=dataset["train"],
)
# Example 2: Implicit reference model (with PEFT/LoRA)
from peft import LoraConfig
peft_config = LoraConfig(r=32, lora_alpha=16, target_modules="all-linear")
trainer = DPOTrainer(
model=model,
ref_model=None, # base model with adapters disabled serves as reference
args=training_args,
train_dataset=dataset["train"],
peft_config=peft_config,
)
# Example 3: Precomputed reference log probabilities
training_args = DPOConfig(
output_dir="./dpo-output",
precompute_ref_log_probs=True, # compute ref logps before training
)
trainer = DPOTrainer(
model=model,
ref_model=ref_model, # used only during precomputation
args=training_args,
train_dataset=dataset["train"],
)