Implementation:Openai Evals RecorderBase Record Final Report
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Logging |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
Concrete tool for recording evaluation events and final reports provided by the evals RecorderBase class hierarchy.
Description
The RecorderBase class provides thread-safe event recording with multiple concrete backends. It maintains an in-memory event list with periodic flushing, supports context-managed sample ID tracking, and offers typed recording methods for common event types. LocalRecorder writes JSONL to local files, HttpRecorder posts events to an HTTP endpoint, Recorder writes to Snowflake, and DummyRecorder logs to console. The record_final_report method persists the aggregated metrics dict at the end of an eval run.
Usage
Use RecorderBase methods throughout eval execution to log results. The recorder is passed to Eval.run() and individual eval_sample calls use the default recorder via context variable to record match/sampling/metrics events.
Code Reference
Source Location
- Repository: openai/evals
- File: evals/record.py (lines 54-264 for RecorderBase, full file 642 lines)
Signature
class RecorderBase:
def __init__(self, run_spec: RunSpec) -> None:
"""Initialize recorder with run specification."""
def record_event(self, type: str, data: dict = None, sample_id: str = None) -> None:
"""Record a generic event."""
def record_match(self, correct: bool, *, expected=None, picked=None, sample_id=None, **extra) -> None:
"""Record a match/non-match result."""
def record_sampling(self, prompt, sampled, sample_id=None, **extra) -> None:
"""Record a model sampling event."""
def record_metrics(self, **kwargs) -> None:
"""Record arbitrary metrics as key-value pairs."""
def record_final_report(self, final_report: Any) -> None:
"""Record the final aggregated report."""
def get_events(self, type: str) -> Sequence[Event]:
"""Retrieve all events of a given type."""
def get_metrics(self) -> list[dict]:
"""Retrieve all recorded metrics."""
@contextlib.contextmanager
def as_default_recorder(self, sample_id: str):
"""Context manager that sets this recorder as default for a sample."""
Import
from evals.record import RecorderBase, LocalRecorder, DummyRecorder, Event
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| run_spec | RunSpec | Yes | Run specification with run_id, eval_name, etc. |
| type | str | Yes (for record_event) | Event type: "match", "sampling", "metrics", "error", etc. |
| data | dict | No | Event data payload |
| final_report | Any | Yes (for record_final_report) | Aggregated metrics dictionary |
Outputs
| Name | Type | Description |
|---|---|---|
| Events | list[Event] | In-memory list of recorded Event dataclass instances |
| Log file | JSONL file | Persisted events in JSON lines format (LocalRecorder) |
| get_events return | Sequence[Event] | Filtered events by type |
Usage Examples
Using Recorder in an Eval
from evals.record import LocalRecorder
from evals.base import RunSpec
run_spec = RunSpec(
completion_fns=["gpt-3.5-turbo"],
eval_name="my-eval.dev",
base_eval="my-eval",
split="dev",
run_config={},
created_by="user",
)
recorder = LocalRecorder("/tmp/evallogs/my-run.jsonl", run_spec=run_spec)
# Record events during evaluation
with recorder.as_default_recorder("sample_0"):
recorder.record_match(correct=True, expected="Paris", picked="Paris")
recorder.record_sampling(prompt="What is the capital of France?", sampled="Paris")
recorder.record_metrics(accuracy=1.0)
# Record final report
recorder.record_final_report({"accuracy": 0.95, "bootstrap_std": 0.02})