Implementation:ContextualAI HALOs Alpaca Eval CLI
Appearance
| Knowledge Sources | |
|---|---|
| Domains | NLP, Evaluation |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
Concrete tool for running AlpacaEval instruction-following benchmarks using the alpaca_eval CLI with custom annotator configs.
Description
The HALOs framework provides custom AlpacaEval annotator configurations for GPT-4.1 and GPT-4.1-mini judges. The evaluation pipeline:
- Generate model outputs using
train.samplein alpacaeval mode (produces JSON withinstructionandoutputfields) - Run
alpaca_eval evaluatewith the custom annotator YAML pointing to the prompt template inalpaca_eval.txt
The annotator configs specify: model name, temperature (0 for deterministic), max tokens (4096), and the prompt template that instructs the judge to compare outputs.
Usage
Run after sampling AlpacaEval outputs. Requires the OPENAI_API_KEY environment variable for the judge model.
Code Reference
Source Location
- Repository: ContextualAI/HALOs
- File: alpaca_eval_gpt-4.1.yaml (annotator config), alpaca_eval_gpt-4.1-mini.yaml, alpaca_eval.txt (prompt template)
- Lines: alpaca_eval_gpt-4.1.yaml:L1-16, alpaca_eval.txt:L1-33
Signature
alpaca_eval evaluate \
--model_outputs <outputs.json> \
--annotators_config <annotator.yaml> \
--output_path <results_dir>
Import
pip install alpaca-eval # Installed by install.sh
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_outputs | JSON file | Yes | Model outputs with 'instruction' and 'output' fields (from train.sample --mode alpacaeval) |
| annotators_config | YAML file | Yes | Annotator config (alpaca_eval_gpt-4.1.yaml or alpaca_eval_gpt-4.1-mini.yaml) |
| OPENAI_API_KEY | env var | Yes | OpenAI API key for GPT-4 judge |
Outputs
| Name | Type | Description |
|---|---|---|
| Win rate (WR) | float | Percentage of prompts where model output is preferred |
| Length-controlled win rate (LCWR) | float | Length-bias-adjusted win rate |
| Results directory | Files | Annotations, leaderboard metrics, per-example judgments |
Usage Examples
Full AlpacaEval Pipeline
# Step 1: Sample from the aligned model
python -m train.sample /models/llama3-8B-kto/FINAL \
--datasets alpacaeval \
--mode alpacaeval \
--split test \
--gpu_count 4 \
--output_file alpacaeval_outputs.json
# Step 2: Run AlpacaEval with GPT-4.1 judge
alpaca_eval evaluate \
--model_outputs alpacaeval_outputs.json \
--annotators_config alpaca_eval_gpt-4.1.yaml \
--output_path results/llama3-8B-kto/
Using GPT-4.1-mini (Cheaper Judge)
alpaca_eval evaluate \
--model_outputs alpacaeval_outputs.json \
--annotators_config alpaca_eval_gpt-4.1-mini.yaml \
--output_path results/llama3-8B-kto-mini/
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment