Implementation:Allenai Open instruct Human Eval Compute Metrics

Knowledge Sources	Allenai_Open_instruct
Domains	Evaluation, Statistics
Last Updated	2026-02-07 02:00 GMT

Overview

Concrete tool for computing evaluation metrics and inter-annotator agreement from exported human evaluation annotation data.

Description

The compute_metrics.py module provides the quantitative analysis layer for human evaluation studies. It reads annotation data from Excel exports, deduplicates records by keeping the latest annotation per evaluator per instance, then processes results per model pair. get_acceptance_results calculates per-model acceptance rates and inter-annotator agreement on acceptability judgments. get_comparison_results computes win rates by categorizing preferences into clearly better, slightly better, and tie, with both strict and relaxed agreement modes (where ties count as 0.5 agreement).

Usage

Use this module after exporting evaluation data from the human eval app to compute publishable metrics. The agreement calculations establish reliability of human judgments.

Code Reference

Source Location

Repository: Allenai_Open_instruct
File: human_eval/compute_metrics.py
Lines: 1-206

Signature

def get_acceptance_results(
    records: list,
    target_model_a: str,
    target_model_b: str
) -> dict:
    """Compute per-model acceptance rates and inter-annotator agreement.

    Returns dict with acceptance rates per model and agreement metrics
    including raw and relaxed agreement scores.
    """

def get_comparison_results(
    records: list,
    target_model_a: str,
    target_model_b: str
) -> dict:
    """Aggregate pairwise comparison preferences and compute win rates.

    Counts clearly/slightly better votes and calculates inter-annotator
    agreement with relaxed scoring (0.5 for ties).
    """

Import

from human_eval.compute_metrics import get_acceptance_results, get_comparison_results

I/O Contract

Inputs

Name	Type	Required	Description
records	list	Yes	Evaluation records (from DB or Excel export)
target_model_a	str	Yes	Name of first model in comparison pair
target_model_b	str	Yes	Name of second model in comparison pair

Outputs

Name	Type	Description
acceptance_results	dict	Per-model acceptance rates and agreement metrics
comparison_results	dict	Win rates (clear/slight), ties, and inter-annotator agreement

Usage Examples

Computing Metrics from Exported Data

import pandas as pd
from human_eval.compute_metrics import get_acceptance_results, get_comparison_results

# Load exported evaluation data
df = pd.read_excel("evaluation_export.xlsx")

# Convert to record objects (namedtuple or dataclass with matching fields)
records = df.to_records()

# Compute acceptance rates
acceptance = get_acceptance_results(records, "model_a_name", "model_b_name")
print(f"Model A acceptance: {acceptance['model_a_name']:.2%}")
print(f"Model B acceptance: {acceptance['model_b_name']:.2%}")
print(f"Agreement: {acceptance['agreement']['acceptance_agreement']:.2%}")

# Compute comparison win rates
comparison = get_comparison_results(records, "model_a_name", "model_b_name")
print(f"Model A win rate: {comparison['model_a_name_win_rate']:.2%}")

Related Pages

Environment:Allenai_Open_instruct_Python_3_12_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment