Implementation:Huggingface Alignment handbook MT Bench AlpacaEval
Appearance
| Knowledge Sources | |
|---|---|
| Domains | NLP, Evaluation |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tools for evaluating aligned models using MT-Bench and AlpacaEval external benchmarks, as documented in the alignment-handbook evaluation guide.
Description
The alignment-handbook does not include evaluation code in-repo. Instead, it documents external evaluation tools in scripts/README.md:
- MT-Bench (from the FastChat library): Runs 80 multi-turn questions through the model and uses GPT-4 to judge response quality on a 1-10 scale across 8 categories.
- AlpacaEval (from tatsu-lab): Runs 805 single-turn instructions and computes win rate against a reference model using GPT-4 as judge.
Both tools require a vLLM inference server running the trained model.
Usage
Use after training is complete. Launch a vLLM inference server with the trained model, then run MT-Bench and/or AlpacaEval against it.
Code Reference
Source Location
- Repository: alignment-handbook
- File: scripts/README.md (documentation only, no in-repo implementation)
Signature
# MT-Bench evaluation (external tool)
python -m fastchat.llm_judge.gen_model_answer \
--model-path <model_path> \
--model-id <model_id>
python -m fastchat.llm_judge.gen_judgment \
--model-list <model_id>
python -m fastchat.llm_judge.show_result
# AlpacaEval evaluation (external tool)
alpaca_eval --model_outputs <output_file> \
--annotators_config weighted_alpaca_eval_gpt4_turbo
Import
# External tools installed via pip
pip install "fschat[model_worker,webui]"
pip install alpaca-eval
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_path | str | Yes | Path to trained model (local or HuggingFace Hub ID) |
| num_gpus | int | No | Number of GPUs for vLLM inference server |
Outputs
| Name | Type | Description |
|---|---|---|
| MT-Bench scores | Dict | Scores per category (1-10 scale) and overall average |
| AlpacaEval win rate | float | Percentage of wins against reference model (0-100%) |
| AlpacaEval LC win rate | float | Length-controlled win rate |
Usage Examples
MT-Bench Evaluation Pipeline
# 1. Start vLLM inference server
python -m vllm.entrypoints.openai.api_server \
--model alignment-handbook/zephyr-7b-sft-full \
--tensor-parallel-size 1
# 2. Generate model answers
python -m fastchat.llm_judge.gen_model_answer \
--model-path alignment-handbook/zephyr-7b-sft-full \
--model-id zephyr-7b-sft
# 3. Run GPT-4 judgment
python -m fastchat.llm_judge.gen_judgment \
--model-list zephyr-7b-sft
# 4. View results
python -m fastchat.llm_judge.show_result
AlpacaEval 2.0
# Generate model outputs and evaluate
alpaca_eval --model_outputs outputs/zephyr-7b-dpo.json \
--annotators_config weighted_alpaca_eval_gpt4_turbo
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment