Principle:Lucidrains X transformers Aligned Model Evaluation
Principle: Aligned_Model_Evaluation
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | Paper (Direct Preference Optimization — Rafailov et al.), Repo (x-transformers) |
| Domains | Deep_Learning, NLP, Alignment, Evaluation |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Evaluation and inference procedure for extracting and using a DPO-aligned language model for text generation.
Description
After DPO training, the aligned policy model is extracted from dpo.policy_model and can be used for inference. Because DPO trains the raw TransformerWrapper (not a wrapped version), the policy model must be wrapped with AutoregressiveWrapper before it can be used for text generation.
The evaluation and inference procedure involves:
- Model extraction: Access the aligned policy model via
dpo.policy_model. This is a standard TransformerWrapper instance whose weights have been updated by DPO training to prefer human-preferred completions. - Generation wrapping: Wrap the extracted model with AutoregressiveWrapper to gain access to the
.generate()method, which provides autoregressive decoding with multiple sampling strategies (top-k, top-p, min-p, beam search, etc.). - Quality comparison: The aligned model should produce outputs more aligned with human preferences compared to the base (reference) model. Evaluation typically involves generating from the aligned model and comparing output quality against the base or reference model using human evaluation or automated metrics.
Usage
Use after DPO training is complete:
- Extract the policy model from the trained DPO instance via
dpo.policy_model. - Wrap it with AutoregressiveWrapper to enable generation.
- Generate text using the
.generate()method with desired sampling parameters. - Optionally compare outputs against the reference model (
dpo.ref_model) to verify alignment improvement.
Theoretical Basis
Alignment Verification
The DPO-trained policy model should satisfy the following property for human-preferred completion y_w and unpreferred completion y_l:
π_θ(y_w|x) / π_θ(y_l|x) > π_ref(y_w|x) / π_ref(y_l|x)
That is, the trained policy assigns a higher relative probability to human-preferred completions compared to the reference model. This means the aligned model has shifted its distribution to favor outputs that humans prefer, while the KL constraint (controlled by β) prevents the model from deviating too far from the reference distribution and losing general language modeling quality.
Evaluation Approaches
Common evaluation approaches for DPO-aligned models include:
- Win-rate comparison: Generate completions from both the aligned and reference models for the same prompts, then have human raters (or a judge model) choose which completion is preferred.
- Perplexity monitoring: Track perplexity on held-out text to ensure the aligned model has not degraded in general language modeling quality.
- Preference log-ratio analysis: Compute
log π_θ(y_w|x) - log π_θ(y_l|x)on a held-out preference set to verify the model has learned to prefer the correct completions.