Principle:Openai Evals Result Recording
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Logging |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
An event-based recording system that captures evaluation results, metrics, and metadata across multiple storage backends.
Description
Result Recording provides the infrastructure for persisting evaluation events during and after an eval run. The system uses a RecorderBase abstract class with concrete implementations for local JSON files (LocalRecorder), HTTP endpoints (HttpRecorder), Snowflake databases (Recorder), and no-op testing (DummyRecorder). Events are thread-safe, batched for efficiency, and include typed categories: match results, sampling data, metrics, embeddings, conditional log probabilities, and error reports. The final aggregated report is recorded separately via record_final_report.
Usage
Result recording is used in every evaluation run. The recorder is constructed by build_recorder based on CLI flags (--local-run, --http-run, --dry-run) and passed to Eval.run() which uses it throughout sample evaluation.
Theoretical Basis
The recording system follows an append-only event log pattern:
- Events are appended to an in-memory list with thread locking
- Periodic flushing writes accumulated events to the configured backend
- A context manager (as_default_recorder) associates events with sample IDs
- Events are categorized by type: match, sampling, metrics, error, etc.
- The final report aggregates per-sample metrics into summary statistics
Event types and their purposes:
- match — Records whether model output matched expected answer
- sampling — Records raw model completions with prompts
- metrics — Records arbitrary key-value metric pairs
- error — Records exceptions during evaluation
- raw_sample — Records unprocessed sample data
- embedding — Records vector embeddings