Principle:Eric mitchell Direct preference optimization Evaluation And Sampling

Knowledge Sources	Direct Preference Optimization
Domains	Evaluation, Training, NLP
Last Updated	2026-02-08 02:00 GMT

Overview

A training monitoring technique that periodically evaluates model performance on held-out data and generates text samples to qualitatively assess training progress.

Description

Evaluation and sampling provides quantitative and qualitative feedback during training:

Quantitative evaluation: Computing the same loss and metrics (reward accuracies, margins, log probabilities) on held-out evaluation data. For DPO, the key metric is reward accuracy: how often the model assigns higher implicit reward to the chosen response vs the rejected response.
Qualitative sampling: Generating text samples from both the policy and reference models given evaluation prompts. This allows human inspection of how the model's generation quality evolves during training.
Logging: All metrics and samples are logged to Weights & Biases (wandb) for visualization and experiment tracking.

The evaluation is interleaved with training at configurable intervals (eval_every parameter, measured in training examples).

Usage

Use this principle when you need to monitor training progress. Evaluation frequency should balance information needs against compute cost, as evaluation pauses training.

Theoretical Basis

For DPO, the primary evaluation metric is reward accuracy:

$Reward Accuracy = 𝔼_{(y_{w}, y_{l})} [𝟏 [r (y_{w}) > r (y_{l})]]$

where $r (y) = β \log \frac{π_{θ} (y | x)}{π_{r e f} (y | x)}$ is the implicit reward. A reward accuracy above 50% indicates the model is learning to prefer chosen over rejected responses.

Pseudo-code:

# Abstract evaluation loop (NOT actual implementation)
for eval_batch in eval_data:
    loss, metrics = compute_metrics(model, eval_batch)
    log_metrics(metrics)
if sample_during_eval:
    for prompt in eval_prompts:
        policy_sample = model.generate(prompt)
        reference_sample = reference.generate(prompt)
        log_samples(policy_sample, reference_sample)

Related Pages

Implemented By

Implementation:Eric_mitchell_Direct_preference_optimization_Get_Batch_Metrics

Uses Heuristic

Heuristic:Eric_mitchell_Direct_preference_optimization_Disable_Sampling_During_Eval

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment