Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct Human Eval Compute Metrics

From Leeroopedia


Knowledge Sources
Domains Evaluation, Statistics
Last Updated 2026-02-07 02:00 GMT

Overview

Concrete tool for computing evaluation metrics and inter-annotator agreement from exported human evaluation annotation data.

Description

The compute_metrics.py module provides the quantitative analysis layer for human evaluation studies. It reads annotation data from Excel exports, deduplicates records by keeping the latest annotation per evaluator per instance, then processes results per model pair. get_acceptance_results calculates per-model acceptance rates and inter-annotator agreement on acceptability judgments. get_comparison_results computes win rates by categorizing preferences into clearly better, slightly better, and tie, with both strict and relaxed agreement modes (where ties count as 0.5 agreement).

Usage

Use this module after exporting evaluation data from the human eval app to compute publishable metrics. The agreement calculations establish reliability of human judgments.

Code Reference

Source Location

Signature

def get_acceptance_results(
    records: list,
    target_model_a: str,
    target_model_b: str
) -> dict:
    """Compute per-model acceptance rates and inter-annotator agreement.

    Returns dict with acceptance rates per model and agreement metrics
    including raw and relaxed agreement scores.
    """

def get_comparison_results(
    records: list,
    target_model_a: str,
    target_model_b: str
) -> dict:
    """Aggregate pairwise comparison preferences and compute win rates.

    Counts clearly/slightly better votes and calculates inter-annotator
    agreement with relaxed scoring (0.5 for ties).
    """

Import

from human_eval.compute_metrics import get_acceptance_results, get_comparison_results

I/O Contract

Inputs

Name Type Required Description
records list Yes Evaluation records (from DB or Excel export)
target_model_a str Yes Name of first model in comparison pair
target_model_b str Yes Name of second model in comparison pair

Outputs

Name Type Description
acceptance_results dict Per-model acceptance rates and agreement metrics
comparison_results dict Win rates (clear/slight), ties, and inter-annotator agreement

Usage Examples

Computing Metrics from Exported Data

import pandas as pd
from human_eval.compute_metrics import get_acceptance_results, get_comparison_results

# Load exported evaluation data
df = pd.read_excel("evaluation_export.xlsx")

# Convert to record objects (namedtuple or dataclass with matching fields)
records = df.to_records()

# Compute acceptance rates
acceptance = get_acceptance_results(records, "model_a_name", "model_b_name")
print(f"Model A acceptance: {acceptance['model_a_name']:.2%}")
print(f"Model B acceptance: {acceptance['model_b_name']:.2%}")
print(f"Agreement: {acceptance['agreement']['acceptance_agreement']:.2%}")

# Compute comparison win rates
comparison = get_comparison_results(records, "model_a_name", "model_b_name")
print(f"Model A win rate: {comparison['model_a_name_win_rate']:.2%}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment