Principle:Eric mitchell Direct preference optimization Evaluation And Sampling
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Training, NLP |
| Last Updated | 2026-02-08 02:00 GMT |
Overview
A training monitoring technique that periodically evaluates model performance on held-out data and generates text samples to qualitatively assess training progress.
Description
Evaluation and sampling provides quantitative and qualitative feedback during training:
- Quantitative evaluation: Computing the same loss and metrics (reward accuracies, margins, log probabilities) on held-out evaluation data. For DPO, the key metric is reward accuracy: how often the model assigns higher implicit reward to the chosen response vs the rejected response.
- Qualitative sampling: Generating text samples from both the policy and reference models given evaluation prompts. This allows human inspection of how the model's generation quality evolves during training.
- Logging: All metrics and samples are logged to Weights & Biases (wandb) for visualization and experiment tracking.
The evaluation is interleaved with training at configurable intervals (eval_every parameter, measured in training examples).
Usage
Use this principle when you need to monitor training progress. Evaluation frequency should balance information needs against compute cost, as evaluation pauses training.
Theoretical Basis
For DPO, the primary evaluation metric is reward accuracy:
where is the implicit reward. A reward accuracy above 50% indicates the model is learning to prefer chosen over rejected responses.
Pseudo-code:
# Abstract evaluation loop (NOT actual implementation)
for eval_batch in eval_data:
loss, metrics = compute_metrics(model, eval_batch)
log_metrics(metrics)
if sample_during_eval:
for prompt in eval_prompts:
policy_sample = model.generate(prompt)
reference_sample = reference.generate(prompt)
log_samples(policy_sample, reference_sample)