Principle: Aligned_Model_Evaluation

Metadata

Field	Value
Page Type	Principle
Knowledge Sources	Paper (Direct Preference Optimization — Rafailov et al.), Repo (x-transformers)
Domains	Deep_Learning, NLP, Alignment, Evaluation
Last Updated	2026-02-08 18:00 GMT

Overview

Evaluation and inference procedure for extracting and using a DPO-aligned language model for text generation.

Description

After DPO training, the aligned policy model is extracted from dpo.policy_model and can be used for inference. Because DPO trains the raw TransformerWrapper (not a wrapped version), the policy model must be wrapped with AutoregressiveWrapper before it can be used for text generation.

The evaluation and inference procedure involves:

Model extraction: Access the aligned policy model via dpo.policy_model. This is a standard TransformerWrapper instance whose weights have been updated by DPO training to prefer human-preferred completions.
Generation wrapping: Wrap the extracted model with AutoregressiveWrapper to gain access to the .generate() method, which provides autoregressive decoding with multiple sampling strategies (top-k, top-p, min-p, beam search, etc.).
Quality comparison: The aligned model should produce outputs more aligned with human preferences compared to the base (reference) model. Evaluation typically involves generating from the aligned model and comparing output quality against the base or reference model using human evaluation or automated metrics.

Usage

Use after DPO training is complete:

Extract the policy model from the trained DPO instance via dpo.policy_model.
Wrap it with AutoregressiveWrapper to enable generation.
Generate text using the .generate() method with desired sampling parameters.
Optionally compare outputs against the reference model (dpo.ref_model) to verify alignment improvement.

Theoretical Basis

Alignment Verification

The DPO-trained policy model should satisfy the following property for human-preferred completion y_w and unpreferred completion y_l:

π_θ(y_w|x) / π_θ(y_l|x) > π_ref(y_w|x) / π_ref(y_l|x)

That is, the trained policy assigns a higher relative probability to human-preferred completions compared to the reference model. This means the aligned model has shifted its distribution to favor outputs that humans prefer, while the KL constraint (controlled by β) prevents the model from deviating too far from the reference distribution and losing general language modeling quality.

Evaluation Approaches

Common evaluation approaches for DPO-aligned models include:

Win-rate comparison: Generate completions from both the aligned and reference models for the same prompts, then have human raters (or a judge model) choose which completion is preferred.
Perplexity monitoring: Track perplexity on held-out text to ensure the aligned model has not degraded in general language modeling quality.
Preference log-ratio analysis: Compute log π_θ(y_w|x) - log π_θ(y_l|x) on a held-out preference set to verify the model has learned to prefer the correct completions.

Related Pages

Implemented By

Implementation:Lucidrains_X_transformers_DPO_Policy_Model_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment