Implementation:Openai Evals RecorderBase Record Final Report

Knowledge Sources	OpenAI Evals
Domains	Evaluation, Logging
Last Updated	2026-02-14 10:00 GMT

Overview

Concrete tool for recording evaluation events and final reports provided by the evals RecorderBase class hierarchy.

Description

The RecorderBase class provides thread-safe event recording with multiple concrete backends. It maintains an in-memory event list with periodic flushing, supports context-managed sample ID tracking, and offers typed recording methods for common event types. LocalRecorder writes JSONL to local files, HttpRecorder posts events to an HTTP endpoint, Recorder writes to Snowflake, and DummyRecorder logs to console. The record_final_report method persists the aggregated metrics dict at the end of an eval run.

Usage

Use RecorderBase methods throughout eval execution to log results. The recorder is passed to Eval.run() and individual eval_sample calls use the default recorder via context variable to record match/sampling/metrics events.

Code Reference

Source Location

Repository: openai/evals
File: evals/record.py (lines 54-264 for RecorderBase, full file 642 lines)

Signature

class RecorderBase:
    def __init__(self, run_spec: RunSpec) -> None:
        """Initialize recorder with run specification."""

    def record_event(self, type: str, data: dict = None, sample_id: str = None) -> None:
        """Record a generic event."""

    def record_match(self, correct: bool, *, expected=None, picked=None, sample_id=None, **extra) -> None:
        """Record a match/non-match result."""

    def record_sampling(self, prompt, sampled, sample_id=None, **extra) -> None:
        """Record a model sampling event."""

    def record_metrics(self, **kwargs) -> None:
        """Record arbitrary metrics as key-value pairs."""

    def record_final_report(self, final_report: Any) -> None:
        """Record the final aggregated report."""

    def get_events(self, type: str) -> Sequence[Event]:
        """Retrieve all events of a given type."""

    def get_metrics(self) -> list[dict]:
        """Retrieve all recorded metrics."""

    @contextlib.contextmanager
    def as_default_recorder(self, sample_id: str):
        """Context manager that sets this recorder as default for a sample."""

Import

from evals.record import RecorderBase, LocalRecorder, DummyRecorder, Event

I/O Contract

Inputs

Name	Type	Required	Description
run_spec	RunSpec	Yes	Run specification with run_id, eval_name, etc.
type	str	Yes (for record_event)	Event type: "match", "sampling", "metrics", "error", etc.
data	dict	No	Event data payload
final_report	Any	Yes (for record_final_report)	Aggregated metrics dictionary

Outputs

Name	Type	Description
Events	list[Event]	In-memory list of recorded Event dataclass instances
Log file	JSONL file	Persisted events in JSON lines format (LocalRecorder)
get_events return	Sequence[Event]	Filtered events by type

Usage Examples

Using Recorder in an Eval

from evals.record import LocalRecorder
from evals.base import RunSpec

run_spec = RunSpec(
    completion_fns=["gpt-3.5-turbo"],
    eval_name="my-eval.dev",
    base_eval="my-eval",
    split="dev",
    run_config={},
    created_by="user",
)

recorder = LocalRecorder("/tmp/evallogs/my-run.jsonl", run_spec=run_spec)

# Record events during evaluation
with recorder.as_default_recorder("sample_0"):
    recorder.record_match(correct=True, expected="Paris", picked="Paris")
    recorder.record_sampling(prompt="What is the capital of France?", sampled="Paris")
    recorder.record_metrics(accuracy=1.0)

# Record final report
recorder.record_final_report({"accuracy": 0.95, "bootstrap_std": 0.02})

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment