Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Eric mitchell Direct preference optimization Evaluation And Sampling

From Leeroopedia


Knowledge Sources
Domains Evaluation, Training, NLP
Last Updated 2026-02-08 02:00 GMT

Overview

A training monitoring technique that periodically evaluates model performance on held-out data and generates text samples to qualitatively assess training progress.

Description

Evaluation and sampling provides quantitative and qualitative feedback during training:

  • Quantitative evaluation: Computing the same loss and metrics (reward accuracies, margins, log probabilities) on held-out evaluation data. For DPO, the key metric is reward accuracy: how often the model assigns higher implicit reward to the chosen response vs the rejected response.
  • Qualitative sampling: Generating text samples from both the policy and reference models given evaluation prompts. This allows human inspection of how the model's generation quality evolves during training.
  • Logging: All metrics and samples are logged to Weights & Biases (wandb) for visualization and experiment tracking.

The evaluation is interleaved with training at configurable intervals (eval_every parameter, measured in training examples).

Usage

Use this principle when you need to monitor training progress. Evaluation frequency should balance information needs against compute cost, as evaluation pauses training.

Theoretical Basis

For DPO, the primary evaluation metric is reward accuracy:

Reward Accuracy=𝔼(yw,yl)[𝟏[r(yw)>r(yl)]]

where r(y)=βlogπθ(y|x)πref(y|x) is the implicit reward. A reward accuracy above 50% indicates the model is learning to prefer chosen over rejected responses.

Pseudo-code:

# Abstract evaluation loop (NOT actual implementation)
for eval_batch in eval_data:
    loss, metrics = compute_metrics(model, eval_batch)
    log_metrics(metrics)
if sample_during_eval:
    for prompt in eval_prompts:
        policy_sample = model.generate(prompt)
        reference_sample = reference.generate(prompt)
        log_samples(policy_sample, reference_sample)

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment