Implementation:Huggingface Alignment handbook MT Bench AlpacaEval

Knowledge Sources	Alignment Handbook FastChat AlpacaEval
Domains	NLP, Evaluation
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tools for evaluating aligned models using MT-Bench and AlpacaEval external benchmarks, as documented in the alignment-handbook evaluation guide.

Description

The alignment-handbook does not include evaluation code in-repo. Instead, it documents external evaluation tools in scripts/README.md:

MT-Bench (from the FastChat library): Runs 80 multi-turn questions through the model and uses GPT-4 to judge response quality on a 1-10 scale across 8 categories.
AlpacaEval (from tatsu-lab): Runs 805 single-turn instructions and computes win rate against a reference model using GPT-4 as judge.

Both tools require a vLLM inference server running the trained model.

Usage

Use after training is complete. Launch a vLLM inference server with the trained model, then run MT-Bench and/or AlpacaEval against it.

Code Reference

Source Location

Repository: alignment-handbook
File: scripts/README.md (documentation only, no in-repo implementation)

Signature

# MT-Bench evaluation (external tool)
python -m fastchat.llm_judge.gen_model_answer \
    --model-path <model_path> \
    --model-id <model_id>

python -m fastchat.llm_judge.gen_judgment \
    --model-list <model_id>

python -m fastchat.llm_judge.show_result

# AlpacaEval evaluation (external tool)
alpaca_eval --model_outputs <output_file> \
    --annotators_config weighted_alpaca_eval_gpt4_turbo

Import

# External tools installed via pip
pip install "fschat[model_worker,webui]"
pip install alpaca-eval

I/O Contract

Inputs

Name	Type	Required	Description
model_path	str	Yes	Path to trained model (local or HuggingFace Hub ID)
num_gpus	int	No	Number of GPUs for vLLM inference server

Outputs

Name	Type	Description
MT-Bench scores	Dict	Scores per category (1-10 scale) and overall average
AlpacaEval win rate	float	Percentage of wins against reference model (0-100%)
AlpacaEval LC win rate	float	Length-controlled win rate

Usage Examples

MT-Bench Evaluation Pipeline

# 1. Start vLLM inference server
python -m vllm.entrypoints.openai.api_server \
    --model alignment-handbook/zephyr-7b-sft-full \
    --tensor-parallel-size 1

# 2. Generate model answers
python -m fastchat.llm_judge.gen_model_answer \
    --model-path alignment-handbook/zephyr-7b-sft-full \
    --model-id zephyr-7b-sft

# 3. Run GPT-4 judgment
python -m fastchat.llm_judge.gen_judgment \
    --model-list zephyr-7b-sft

# 4. View results
python -m fastchat.llm_judge.show_result

AlpacaEval 2.0

# Generate model outputs and evaluate
alpaca_eval --model_outputs outputs/zephyr-7b-dpo.json \
    --annotators_config weighted_alpaca_eval_gpt4_turbo

Related Pages

Implements Principle

Principle:Huggingface_Alignment_handbook_LLM_Evaluation_Benchmarks

Requires Environment

Environment:Huggingface_Alignment_handbook_Evaluation_Tools

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment