Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Alignment handbook MT Bench AlpacaEval

From Leeroopedia


Knowledge Sources
Domains NLP, Evaluation
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tools for evaluating aligned models using MT-Bench and AlpacaEval external benchmarks, as documented in the alignment-handbook evaluation guide.

Description

The alignment-handbook does not include evaluation code in-repo. Instead, it documents external evaluation tools in scripts/README.md:

  • MT-Bench (from the FastChat library): Runs 80 multi-turn questions through the model and uses GPT-4 to judge response quality on a 1-10 scale across 8 categories.
  • AlpacaEval (from tatsu-lab): Runs 805 single-turn instructions and computes win rate against a reference model using GPT-4 as judge.

Both tools require a vLLM inference server running the trained model.

Usage

Use after training is complete. Launch a vLLM inference server with the trained model, then run MT-Bench and/or AlpacaEval against it.

Code Reference

Source Location

  • Repository: alignment-handbook
  • File: scripts/README.md (documentation only, no in-repo implementation)

Signature

# MT-Bench evaluation (external tool)
python -m fastchat.llm_judge.gen_model_answer \
    --model-path <model_path> \
    --model-id <model_id>

python -m fastchat.llm_judge.gen_judgment \
    --model-list <model_id>

python -m fastchat.llm_judge.show_result

# AlpacaEval evaluation (external tool)
alpaca_eval --model_outputs <output_file> \
    --annotators_config weighted_alpaca_eval_gpt4_turbo

Import

# External tools installed via pip
pip install "fschat[model_worker,webui]"
pip install alpaca-eval

I/O Contract

Inputs

Name Type Required Description
model_path str Yes Path to trained model (local or HuggingFace Hub ID)
num_gpus int No Number of GPUs for vLLM inference server

Outputs

Name Type Description
MT-Bench scores Dict Scores per category (1-10 scale) and overall average
AlpacaEval win rate float Percentage of wins against reference model (0-100%)
AlpacaEval LC win rate float Length-controlled win rate

Usage Examples

MT-Bench Evaluation Pipeline

# 1. Start vLLM inference server
python -m vllm.entrypoints.openai.api_server \
    --model alignment-handbook/zephyr-7b-sft-full \
    --tensor-parallel-size 1

# 2. Generate model answers
python -m fastchat.llm_judge.gen_model_answer \
    --model-path alignment-handbook/zephyr-7b-sft-full \
    --model-id zephyr-7b-sft

# 3. Run GPT-4 judgment
python -m fastchat.llm_judge.gen_judgment \
    --model-list zephyr-7b-sft

# 4. View results
python -m fastchat.llm_judge.show_result

AlpacaEval 2.0

# Generate model outputs and evaluate
alpaca_eval --model_outputs outputs/zephyr-7b-dpo.json \
    --annotators_config weighted_alpaca_eval_gpt4_turbo

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment