Implementation:Openai Evals Get Accuracy

Knowledge Sources	OpenAI Evals
Domains	Evaluation, Statistics
Last Updated	2026-02-14 10:00 GMT

Overview

Concrete tool for computing standard evaluation metrics from recorded events provided by the evals metrics module.

Description

The evals.metrics module provides functions for computing accuracy, bootstrap standard deviation, confusion matrices, precision, recall, F-score, and Matthew's correlation coefficient from sequences of Event objects. get_accuracy is the most commonly used function, computing the fraction of events with correct=True. These functions are typically called in the run method of Eval subclasses after eval_all_samples completes.

Usage

Call these functions in custom eval run methods to aggregate match events into summary metrics. Pass the events from recorder.get_events("match").

Code Reference

Source Location

Repository: openai/evals
File: evals/metrics.py (lines 12-73)

Signature

def get_accuracy(events: Sequence[Event]) -> float:
    """Compute accuracy as fraction of correct events. Returns NaN if no events."""

def get_bootstrap_accuracy_std(events: Sequence[Event], num_samples: int = 1000) -> float:
    """Compute bootstrap standard deviation of accuracy."""

def get_confusion_matrix(
    matches: Sequence[Event],
    class_labels: Optional[Set] = None,
) -> np.ndarray:
    """Build confusion matrix from match events. Returns N×(N+1) array."""

def compute_precision(confusion_matrix: np.ndarray, idx: int = 0) -> float:
    """Compute precision for a given class index."""

def compute_recall(confusion_matrix: np.ndarray, idx: int = 0) -> float:
    """Compute recall for a given class index."""

def compute_f_score(
    confusion_matrix: np.ndarray, idx: int = 0, beta: float = 1.0
) -> float:
    """Compute F-score for a given class index."""

def compute_averaged_f_score(
    confusion_matrix: np.ndarray, beta: float = 1.0, average: str = "macro"
) -> float:
    """Compute macro-averaged F-score across all classes."""

Import

from evals.metrics import (
    get_accuracy,
    get_bootstrap_accuracy_std,
    get_confusion_matrix,
    compute_precision,
    compute_recall,
    compute_f_score,
)

I/O Contract

Inputs

Name	Type	Required	Description
events	Sequence[Event]	Yes	Match events with data["correct"] boolean field
num_samples	int	No	Bootstrap resampling count (default 1000)
class_labels	Optional[Set]	No	Expected class labels for confusion matrix

Outputs

Name	Type	Description
get_accuracy	float	Fraction of correct events (0.0 to 1.0, or NaN if empty)
get_bootstrap_accuracy_std	float	Standard deviation of bootstrap accuracy estimates
get_confusion_matrix	np.ndarray	N×(N+1) confusion matrix
compute_precision	float	Precision for specified class
compute_recall	float	Recall for specified class
compute_f_score	float	F-score for specified class

Usage Examples

Computing Accuracy in an Eval

import evals.metrics

class MyEval(evals.Eval):
    def run(self, recorder):
        samples = self.get_samples()
        self.eval_all_samples(recorder, samples)

        events = recorder.get_events("match")
        return {
            "accuracy": evals.metrics.get_accuracy(events),
            "bootstrap_std": evals.metrics.get_bootstrap_accuracy_std(events),
        }

Computing Confusion Matrix

import evals.metrics

events = recorder.get_events("match")
cm = evals.metrics.get_confusion_matrix(events)
precision = evals.metrics.compute_precision(cm, idx=0)
recall = evals.metrics.compute_recall(cm, idx=0)
f1 = evals.metrics.compute_f_score(cm, idx=0, beta=1.0)

Related Pages

Implements Principle

Principle:Openai_Evals_Eval_Metrics

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment