Implementation:Openai Evals Get Accuracy
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Statistics |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
Concrete tool for computing standard evaluation metrics from recorded events provided by the evals metrics module.
Description
The evals.metrics module provides functions for computing accuracy, bootstrap standard deviation, confusion matrices, precision, recall, F-score, and Matthew's correlation coefficient from sequences of Event objects. get_accuracy is the most commonly used function, computing the fraction of events with correct=True. These functions are typically called in the run method of Eval subclasses after eval_all_samples completes.
Usage
Call these functions in custom eval run methods to aggregate match events into summary metrics. Pass the events from recorder.get_events("match").
Code Reference
Source Location
- Repository: openai/evals
- File: evals/metrics.py (lines 12-73)
Signature
def get_accuracy(events: Sequence[Event]) -> float:
"""Compute accuracy as fraction of correct events. Returns NaN if no events."""
def get_bootstrap_accuracy_std(events: Sequence[Event], num_samples: int = 1000) -> float:
"""Compute bootstrap standard deviation of accuracy."""
def get_confusion_matrix(
matches: Sequence[Event],
class_labels: Optional[Set] = None,
) -> np.ndarray:
"""Build confusion matrix from match events. Returns N×(N+1) array."""
def compute_precision(confusion_matrix: np.ndarray, idx: int = 0) -> float:
"""Compute precision for a given class index."""
def compute_recall(confusion_matrix: np.ndarray, idx: int = 0) -> float:
"""Compute recall for a given class index."""
def compute_f_score(
confusion_matrix: np.ndarray, idx: int = 0, beta: float = 1.0
) -> float:
"""Compute F-score for a given class index."""
def compute_averaged_f_score(
confusion_matrix: np.ndarray, beta: float = 1.0, average: str = "macro"
) -> float:
"""Compute macro-averaged F-score across all classes."""
Import
from evals.metrics import (
get_accuracy,
get_bootstrap_accuracy_std,
get_confusion_matrix,
compute_precision,
compute_recall,
compute_f_score,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| events | Sequence[Event] | Yes | Match events with data["correct"] boolean field |
| num_samples | int | No | Bootstrap resampling count (default 1000) |
| class_labels | Optional[Set] | No | Expected class labels for confusion matrix |
Outputs
| Name | Type | Description |
|---|---|---|
| get_accuracy | float | Fraction of correct events (0.0 to 1.0, or NaN if empty) |
| get_bootstrap_accuracy_std | float | Standard deviation of bootstrap accuracy estimates |
| get_confusion_matrix | np.ndarray | N×(N+1) confusion matrix |
| compute_precision | float | Precision for specified class |
| compute_recall | float | Recall for specified class |
| compute_f_score | float | F-score for specified class |
Usage Examples
Computing Accuracy in an Eval
import evals.metrics
class MyEval(evals.Eval):
def run(self, recorder):
samples = self.get_samples()
self.eval_all_samples(recorder, samples)
events = recorder.get_events("match")
return {
"accuracy": evals.metrics.get_accuracy(events),
"bootstrap_std": evals.metrics.get_bootstrap_accuracy_std(events),
}
Computing Confusion Matrix
import evals.metrics
events = recorder.get_events("match")
cm = evals.metrics.get_confusion_matrix(events)
precision = evals.metrics.compute_precision(cm, idx=0)
recall = evals.metrics.compute_recall(cm, idx=0)
f1 = evals.metrics.compute_f_score(cm, idx=0, beta=1.0)