Implementation:Allenai Open instruct Human Eval Compute Metrics
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Statistics |
| Last Updated | 2026-02-07 02:00 GMT |
Overview
Concrete tool for computing evaluation metrics and inter-annotator agreement from exported human evaluation annotation data.
Description
The compute_metrics.py module provides the quantitative analysis layer for human evaluation studies. It reads annotation data from Excel exports, deduplicates records by keeping the latest annotation per evaluator per instance, then processes results per model pair. get_acceptance_results calculates per-model acceptance rates and inter-annotator agreement on acceptability judgments. get_comparison_results computes win rates by categorizing preferences into clearly better, slightly better, and tie, with both strict and relaxed agreement modes (where ties count as 0.5 agreement).
Usage
Use this module after exporting evaluation data from the human eval app to compute publishable metrics. The agreement calculations establish reliability of human judgments.
Code Reference
Source Location
- Repository: Allenai_Open_instruct
- File: human_eval/compute_metrics.py
- Lines: 1-206
Signature
def get_acceptance_results(
records: list,
target_model_a: str,
target_model_b: str
) -> dict:
"""Compute per-model acceptance rates and inter-annotator agreement.
Returns dict with acceptance rates per model and agreement metrics
including raw and relaxed agreement scores.
"""
def get_comparison_results(
records: list,
target_model_a: str,
target_model_b: str
) -> dict:
"""Aggregate pairwise comparison preferences and compute win rates.
Counts clearly/slightly better votes and calculates inter-annotator
agreement with relaxed scoring (0.5 for ties).
"""
Import
from human_eval.compute_metrics import get_acceptance_results, get_comparison_results
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| records | list | Yes | Evaluation records (from DB or Excel export) |
| target_model_a | str | Yes | Name of first model in comparison pair |
| target_model_b | str | Yes | Name of second model in comparison pair |
Outputs
| Name | Type | Description |
|---|---|---|
| acceptance_results | dict | Per-model acceptance rates and agreement metrics |
| comparison_results | dict | Win rates (clear/slight), ties, and inter-annotator agreement |
Usage Examples
Computing Metrics from Exported Data
import pandas as pd
from human_eval.compute_metrics import get_acceptance_results, get_comparison_results
# Load exported evaluation data
df = pd.read_excel("evaluation_export.xlsx")
# Convert to record objects (namedtuple or dataclass with matching fields)
records = df.to_records()
# Compute acceptance rates
acceptance = get_acceptance_results(records, "model_a_name", "model_b_name")
print(f"Model A acceptance: {acceptance['model_a_name']:.2%}")
print(f"Model B acceptance: {acceptance['model_b_name']:.2%}")
print(f"Agreement: {acceptance['agreement']['acceptance_agreement']:.2%}")
# Compute comparison win rates
comparison = get_comparison_results(records, "model_a_name", "model_b_name")
print(f"Model A win rate: {comparison['model_a_name_win_rate']:.2%}")