Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ContextualAI HALOs Alpaca Eval CLI

From Leeroopedia


Knowledge Sources
Domains NLP, Evaluation
Last Updated 2026-02-08 03:00 GMT

Overview

Concrete tool for running AlpacaEval instruction-following benchmarks using the alpaca_eval CLI with custom annotator configs.

Description

The HALOs framework provides custom AlpacaEval annotator configurations for GPT-4.1 and GPT-4.1-mini judges. The evaluation pipeline:

  1. Generate model outputs using train.sample in alpacaeval mode (produces JSON with instruction and output fields)
  2. Run alpaca_eval evaluate with the custom annotator YAML pointing to the prompt template in alpaca_eval.txt

The annotator configs specify: model name, temperature (0 for deterministic), max tokens (4096), and the prompt template that instructs the judge to compare outputs.

Usage

Run after sampling AlpacaEval outputs. Requires the OPENAI_API_KEY environment variable for the judge model.

Code Reference

Source Location

  • Repository: ContextualAI/HALOs
  • File: alpaca_eval_gpt-4.1.yaml (annotator config), alpaca_eval_gpt-4.1-mini.yaml, alpaca_eval.txt (prompt template)
  • Lines: alpaca_eval_gpt-4.1.yaml:L1-16, alpaca_eval.txt:L1-33

Signature

alpaca_eval evaluate \
    --model_outputs <outputs.json> \
    --annotators_config <annotator.yaml> \
    --output_path <results_dir>

Import

pip install alpaca-eval  # Installed by install.sh

I/O Contract

Inputs

Name Type Required Description
model_outputs JSON file Yes Model outputs with 'instruction' and 'output' fields (from train.sample --mode alpacaeval)
annotators_config YAML file Yes Annotator config (alpaca_eval_gpt-4.1.yaml or alpaca_eval_gpt-4.1-mini.yaml)
OPENAI_API_KEY env var Yes OpenAI API key for GPT-4 judge

Outputs

Name Type Description
Win rate (WR) float Percentage of prompts where model output is preferred
Length-controlled win rate (LCWR) float Length-bias-adjusted win rate
Results directory Files Annotations, leaderboard metrics, per-example judgments

Usage Examples

Full AlpacaEval Pipeline

# Step 1: Sample from the aligned model
python -m train.sample /models/llama3-8B-kto/FINAL \
    --datasets alpacaeval \
    --mode alpacaeval \
    --split test \
    --gpu_count 4 \
    --output_file alpacaeval_outputs.json

# Step 2: Run AlpacaEval with GPT-4.1 judge
alpaca_eval evaluate \
    --model_outputs alpacaeval_outputs.json \
    --annotators_config alpaca_eval_gpt-4.1.yaml \
    --output_path results/llama3-8B-kto/

Using GPT-4.1-mini (Cheaper Judge)

alpaca_eval evaluate \
    --model_outputs alpacaeval_outputs.json \
    --annotators_config alpaca_eval_gpt-4.1-mini.yaml \
    --output_path results/llama3-8B-kto-mini/

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment