Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Openai Evals RecorderBase Record Final Report

From Leeroopedia
Knowledge Sources
Domains Evaluation, Logging
Last Updated 2026-02-14 10:00 GMT

Overview

Concrete tool for recording evaluation events and final reports provided by the evals RecorderBase class hierarchy.

Description

The RecorderBase class provides thread-safe event recording with multiple concrete backends. It maintains an in-memory event list with periodic flushing, supports context-managed sample ID tracking, and offers typed recording methods for common event types. LocalRecorder writes JSONL to local files, HttpRecorder posts events to an HTTP endpoint, Recorder writes to Snowflake, and DummyRecorder logs to console. The record_final_report method persists the aggregated metrics dict at the end of an eval run.

Usage

Use RecorderBase methods throughout eval execution to log results. The recorder is passed to Eval.run() and individual eval_sample calls use the default recorder via context variable to record match/sampling/metrics events.

Code Reference

Source Location

  • Repository: openai/evals
  • File: evals/record.py (lines 54-264 for RecorderBase, full file 642 lines)

Signature

class RecorderBase:
    def __init__(self, run_spec: RunSpec) -> None:
        """Initialize recorder with run specification."""

    def record_event(self, type: str, data: dict = None, sample_id: str = None) -> None:
        """Record a generic event."""

    def record_match(self, correct: bool, *, expected=None, picked=None, sample_id=None, **extra) -> None:
        """Record a match/non-match result."""

    def record_sampling(self, prompt, sampled, sample_id=None, **extra) -> None:
        """Record a model sampling event."""

    def record_metrics(self, **kwargs) -> None:
        """Record arbitrary metrics as key-value pairs."""

    def record_final_report(self, final_report: Any) -> None:
        """Record the final aggregated report."""

    def get_events(self, type: str) -> Sequence[Event]:
        """Retrieve all events of a given type."""

    def get_metrics(self) -> list[dict]:
        """Retrieve all recorded metrics."""

    @contextlib.contextmanager
    def as_default_recorder(self, sample_id: str):
        """Context manager that sets this recorder as default for a sample."""

Import

from evals.record import RecorderBase, LocalRecorder, DummyRecorder, Event

I/O Contract

Inputs

Name Type Required Description
run_spec RunSpec Yes Run specification with run_id, eval_name, etc.
type str Yes (for record_event) Event type: "match", "sampling", "metrics", "error", etc.
data dict No Event data payload
final_report Any Yes (for record_final_report) Aggregated metrics dictionary

Outputs

Name Type Description
Events list[Event] In-memory list of recorded Event dataclass instances
Log file JSONL file Persisted events in JSON lines format (LocalRecorder)
get_events return Sequence[Event] Filtered events by type

Usage Examples

Using Recorder in an Eval

from evals.record import LocalRecorder
from evals.base import RunSpec

run_spec = RunSpec(
    completion_fns=["gpt-3.5-turbo"],
    eval_name="my-eval.dev",
    base_eval="my-eval",
    split="dev",
    run_config={},
    created_by="user",
)

recorder = LocalRecorder("/tmp/evallogs/my-run.jsonl", run_spec=run_spec)

# Record events during evaluation
with recorder.as_default_recorder("sample_0"):
    recorder.record_match(correct=True, expected="Paris", picked="Paris")
    recorder.record_sampling(prompt="What is the capital of France?", sampled="Paris")
    recorder.record_metrics(accuracy=1.0)

# Record final report
recorder.record_final_report({"accuracy": 0.95, "bootstrap_std": 0.02})

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment