Implementation:Allenai Open instruct Human Eval App
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Web_Application |
| Last Updated | 2026-02-07 02:00 GMT |
Overview
Flask web application that serves a pairwise human evaluation interface for comparing model outputs with user authentication, preference collection, and real-time summary statistics.
Description
The app.py module provides the complete human evaluation workflow for the open-instruct project. It uses Flask with SQLAlchemy (SQLite backend) and Flask-Login for authentication. Two database models are defined: User for evaluator accounts and EvaluationRecord for storing evaluation data including prompt, model completions, acceptability judgments, preference rankings, and optional quality feedback. On startup, it loads comparison instances from a JSONL file. The API randomizes completion order to prevent position bias, tracks evaluator progress, and computes live analytics including acceptance rates, win rates, and inter-annotator agreement.
Usage
Use this application when conducting human evaluation studies comparing model outputs. It supports multiple evaluators with agreement tracking for producing reliable human preference data.
Code Reference
Source Location
- Repository: Allenai_Open_instruct
- File: human_eval/app.py
- Lines: 1-405
Signature
class User(UserMixin, db.Model):
id = db.Column(db.Integer, primary_key=True)
username = db.Column(db.String(100), unique=True)
password = db.Column(db.String(200))
class EvaluationRecord(db.Model):
id = db.Column(db.Integer, primary_key=True)
instance_index = db.Column(db.Integer)
instance_id = db.Column(db.String(200))
prompt = db.Column(db.String(1e4))
model_a = db.Column(db.String(200))
model_b = db.Column(db.String(200))
completion_a = db.Column(db.String(1e4))
completion_b = db.Column(db.String(1e4))
completion_a_is_acceptable = db.Column(db.String(50))
completion_b_is_acceptable = db.Column(db.String(50))
preference = db.Column(db.String(50))
evaluator = db.Column(db.String(100))
timestamp = db.Column(db.String(100))
# Key routes:
# @app.route("/login") - Authentication
# @app.route("/instances/<int:index>") - Serve evaluation instance
# @app.route("/api/model-outputs/<int:index>") - API: shuffled completions
# @app.route("/api/submit-evaluation") - Submit preference judgment
# @app.route("/summary") - Live analytics dashboard
Import
# Run directly as Flask application:
# cd human_eval && flask run
# Or: python human_eval/app.py
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| JSONL data file | File | Yes | Comparison instances with prompts and model completions |
| User credentials | Form data | Yes | Username and password for evaluator authentication |
| Evaluation judgments | JSON POST | Yes | Acceptability and preference ratings per instance |
Outputs
| Name | Type | Description |
|---|---|---|
| evaluation.db | SQLite | Persisted evaluation records with all judgments |
| /summary JSON | API Response | Live acceptance rates, win rates, and agreement metrics |
| /api/model-outputs | API Response | Shuffled completions for position-bias-free evaluation |
Usage Examples
Running the Evaluation Server
# Start the human evaluation web interface
cd human_eval
flask run --port 5000
# Or with debug mode:
python app.py
Accessing the API
import requests
# Get model outputs for instance 0 (completions are shuffled)
response = requests.get("http://localhost:5000/api/model-outputs/0")
data = response.json()
# data contains: prompt, completion_a, completion_b (order randomized)
# Submit an evaluation
requests.post("http://localhost:5000/api/submit-evaluation", json={
"instance_index": 0,
"completion_a_is_acceptable": "yes",
"completion_b_is_acceptable": "no",
"preference": "a-is-clearly-better",
})