Implementation:ContextualAI HALOs Alpaca Eval CLI

Knowledge Sources	ContextualAI HALOs AlpacaEval
Domains	NLP, Evaluation
Last Updated	2026-02-08 03:00 GMT

Overview

Concrete tool for running AlpacaEval instruction-following benchmarks using the alpaca_eval CLI with custom annotator configs.

Description

The HALOs framework provides custom AlpacaEval annotator configurations for GPT-4.1 and GPT-4.1-mini judges. The evaluation pipeline:

Generate model outputs using train.sample in alpacaeval mode (produces JSON with instruction and output fields)
Run alpaca_eval evaluate with the custom annotator YAML pointing to the prompt template in alpaca_eval.txt

The annotator configs specify: model name, temperature (0 for deterministic), max tokens (4096), and the prompt template that instructs the judge to compare outputs.

Usage

Run after sampling AlpacaEval outputs. Requires the OPENAI_API_KEY environment variable for the judge model.

Code Reference

Source Location

Repository: ContextualAI/HALOs
File: alpaca_eval_gpt-4.1.yaml (annotator config), alpaca_eval_gpt-4.1-mini.yaml, alpaca_eval.txt (prompt template)
Lines: alpaca_eval_gpt-4.1.yaml:L1-16, alpaca_eval.txt:L1-33

Signature

alpaca_eval evaluate \
    --model_outputs <outputs.json> \
    --annotators_config <annotator.yaml> \
    --output_path <results_dir>

Import

pip install alpaca-eval  # Installed by install.sh

I/O Contract

Inputs

Name	Type	Required	Description
model_outputs	JSON file	Yes	Model outputs with 'instruction' and 'output' fields (from train.sample --mode alpacaeval)
annotators_config	YAML file	Yes	Annotator config (alpaca_eval_gpt-4.1.yaml or alpaca_eval_gpt-4.1-mini.yaml)
OPENAI_API_KEY	env var	Yes	OpenAI API key for GPT-4 judge

Outputs

Name	Type	Description
Win rate (WR)	float	Percentage of prompts where model output is preferred
Length-controlled win rate (LCWR)	float	Length-bias-adjusted win rate
Results directory	Files	Annotations, leaderboard metrics, per-example judgments

Usage Examples

Full AlpacaEval Pipeline

# Step 1: Sample from the aligned model
python -m train.sample /models/llama3-8B-kto/FINAL \
    --datasets alpacaeval \
    --mode alpacaeval \
    --split test \
    --gpu_count 4 \
    --output_file alpacaeval_outputs.json

# Step 2: Run AlpacaEval with GPT-4.1 judge
alpaca_eval evaluate \
    --model_outputs alpacaeval_outputs.json \
    --annotators_config alpaca_eval_gpt-4.1.yaml \
    --output_path results/llama3-8B-kto/

Using GPT-4.1-mini (Cheaper Judge)

alpaca_eval evaluate \
    --model_outputs alpacaeval_outputs.json \
    --annotators_config alpaca_eval_gpt-4.1-mini.yaml \
    --output_path results/llama3-8B-kto-mini/

Related Pages

Implements Principle

Principle:ContextualAI_HALOs_AlpacaEval_Benchmarking

Requires Environment

Environment:ContextualAI_HALOs_CUDA_12_1_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment