Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Evals Get Accuracy

From Leeroopedia
Revision as of 13:34, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Openai_Evals_Get_Accuracy.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Evaluation, Statistics
Last Updated 2026-02-14 10:00 GMT

Overview

Concrete tool for computing standard evaluation metrics from recorded events provided by the evals metrics module.

Description

The evals.metrics module provides functions for computing accuracy, bootstrap standard deviation, confusion matrices, precision, recall, F-score, and Matthew's correlation coefficient from sequences of Event objects. get_accuracy is the most commonly used function, computing the fraction of events with correct=True. These functions are typically called in the run method of Eval subclasses after eval_all_samples completes.

Usage

Call these functions in custom eval run methods to aggregate match events into summary metrics. Pass the events from recorder.get_events("match").

Code Reference

Source Location

  • Repository: openai/evals
  • File: evals/metrics.py (lines 12-73)

Signature

def get_accuracy(events: Sequence[Event]) -> float:
    """Compute accuracy as fraction of correct events. Returns NaN if no events."""

def get_bootstrap_accuracy_std(events: Sequence[Event], num_samples: int = 1000) -> float:
    """Compute bootstrap standard deviation of accuracy."""

def get_confusion_matrix(
    matches: Sequence[Event],
    class_labels: Optional[Set] = None,
) -> np.ndarray:
    """Build confusion matrix from match events. Returns N×(N+1) array."""

def compute_precision(confusion_matrix: np.ndarray, idx: int = 0) -> float:
    """Compute precision for a given class index."""

def compute_recall(confusion_matrix: np.ndarray, idx: int = 0) -> float:
    """Compute recall for a given class index."""

def compute_f_score(
    confusion_matrix: np.ndarray, idx: int = 0, beta: float = 1.0
) -> float:
    """Compute F-score for a given class index."""

def compute_averaged_f_score(
    confusion_matrix: np.ndarray, beta: float = 1.0, average: str = "macro"
) -> float:
    """Compute macro-averaged F-score across all classes."""

Import

from evals.metrics import (
    get_accuracy,
    get_bootstrap_accuracy_std,
    get_confusion_matrix,
    compute_precision,
    compute_recall,
    compute_f_score,
)

I/O Contract

Inputs

Name Type Required Description
events Sequence[Event] Yes Match events with data["correct"] boolean field
num_samples int No Bootstrap resampling count (default 1000)
class_labels Optional[Set] No Expected class labels for confusion matrix

Outputs

Name Type Description
get_accuracy float Fraction of correct events (0.0 to 1.0, or NaN if empty)
get_bootstrap_accuracy_std float Standard deviation of bootstrap accuracy estimates
get_confusion_matrix np.ndarray N×(N+1) confusion matrix
compute_precision float Precision for specified class
compute_recall float Recall for specified class
compute_f_score float F-score for specified class

Usage Examples

Computing Accuracy in an Eval

import evals.metrics

class MyEval(evals.Eval):
    def run(self, recorder):
        samples = self.get_samples()
        self.eval_all_samples(recorder, samples)

        events = recorder.get_events("match")
        return {
            "accuracy": evals.metrics.get_accuracy(events),
            "bootstrap_std": evals.metrics.get_bootstrap_accuracy_std(events),
        }

Computing Confusion Matrix

import evals.metrics

events = recorder.get_events("match")
cm = evals.metrics.get_confusion_matrix(events)
precision = evals.metrics.compute_precision(cm, idx=0)
recall = evals.metrics.compute_recall(cm, idx=0)
f1 = evals.metrics.compute_f_score(cm, idx=0, beta=1.0)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment