Principle:Huggingface Alignment handbook LLM Evaluation Benchmarks
| Knowledge Sources | |
|---|---|
| Domains | NLP, Evaluation |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
An evaluation methodology that assesses aligned language model quality using LLM-as-a-judge benchmarks and human preference proxies.
Description
LLM Evaluation Benchmarks measure how well an aligned model follows instructions, produces helpful responses, and avoids harmful outputs. The alignment-handbook recommends two primary benchmarks:
- MT-Bench: A multi-turn benchmark with 80 questions across 8 categories (writing, roleplay, extraction, reasoning, math, coding, knowledge, STEM). Uses GPT-4 as a judge to score model responses on a 1-10 scale. Evaluates both single-turn and multi-turn conversation ability.
- AlpacaEval: A single-turn benchmark with 805 instructions. Uses GPT-4 as a judge to compute a win rate: the percentage of times the model's response is preferred over a reference model (text-davinci-003). AlpacaEval 2.0 uses length-controlled win rates to mitigate verbosity bias.
These benchmarks serve as automated proxies for human evaluation, enabling rapid iteration on alignment recipes without expensive human annotation.
Usage
Use LLM evaluation benchmarks after completing the full alignment pipeline (SFT → DPO or ORPO) to assess model quality. These benchmarks require a running inference server (typically vLLM) and API access to a judge model (GPT-4).
Theoretical Basis
LLM-as-a-judge evaluation follows a two-stage process:
# Abstract evaluation flow (NOT real implementation)
# Stage 1: Generate model responses
for question in benchmark_questions:
response = model.generate(question)
# Stage 2: Judge responses
for question, response in zip(questions, responses):
score = judge_model.evaluate(question, response)
# MT-Bench: score 1-10 per category
# AlpacaEval: win/loss vs reference model
Key evaluation metrics:
- MT-Bench score: Average score across 8 categories (1-10 scale). Top models score 8+
- AlpacaEval win rate: Percentage of instructions where the model beats the reference (higher is better)
- AlpacaEval LC win rate: Length-controlled win rate that penalizes verbose responses