Implementation:Lm sys FastChat Rating Systems
| Knowledge Sources | |
|---|---|
| Domains | Model_Evaluation, Statistics |
| Last Updated | 2026-02-07 06:00 GMT |
Overview
rating_systems.py implements multiple statistical rating systems -- including Elo, Bradley-Terry, and bootstrapped variants -- for computing model strength rankings from pairwise arena battle outcomes.
Description
The rating_systems.py module provides a collection of well-established statistical methods for deriving cardinal ratings from pairwise comparison data. These rating systems form the mathematical backbone of the Chatbot Arena leaderboard, translating raw "model A vs. model B" battle outcomes into meaningful numerical rankings that quantify relative model strength.
The classic Elo rating system, implemented in compute_elo, processes battles sequentially and updates ratings using a logistic expected-score formula. It is parameterized by the K-factor (controlling update magnitude), scale, base, and initial rating. While simple and interpretable, sequential Elo is sensitive to battle ordering. The Bradley-Terry model, implemented in compute_bt, addresses this by solving a maximum likelihood estimation problem over the full battle dataset simultaneously, producing ratings that are invariant to battle ordering and statistically principled.
To quantify uncertainty in the computed ratings, compute_bootstrap_elo performs bootstrap resampling: it repeatedly draws random samples (with replacement) from the battle dataset and computes Elo ratings on each sample, producing a distribution of ratings for every model. The resulting confidence intervals are critical for determining whether apparent rating differences between models are statistically significant. Additional rating computation methods provide variants and extensions tailored to specific analysis needs, such as handling ties, controlling for prompt difficulty, or computing ratings within specific time windows.
Usage
Use this module when you need to compute model ratings from pairwise battle data. Choose compute_elo for quick, sequential computation; compute_bt for statistically rigorous maximum-likelihood ratings; and compute_bootstrap_elo when you need confidence intervals. These functions are typically called by elo_analysis.py rather than invoked directly, but they can be used independently for custom analyses.
Code Reference
Source Location
- Repository: Lm_sys_FastChat
- File: fastchat/serve/monitor/rating_systems.py
- Lines: 1-385
Signature
def compute_elo(
battles: pd.DataFrame,
K: float = 4.0,
SCALE: float = 400.0,
BASE: float = 10.0,
INIT_RATING: float = 1000.0,
) -> dict[str, float]:
"""Compute sequential Elo ratings from battle outcomes.
Args:
battles: DataFrame with columns model_a, model_b, winner.
K: K-factor controlling rating update magnitude.
SCALE: Logistic scale parameter.
BASE: Logistic base parameter.
INIT_RATING: Starting rating for all models.
Returns:
Dictionary mapping model names to their Elo ratings.
"""
...
def compute_bt(
battles: pd.DataFrame,
) -> dict[str, float]:
"""Compute Bradley-Terry maximum-likelihood ratings.
Args:
battles: DataFrame with columns model_a, model_b, winner.
Returns:
Dictionary mapping model names to their Bradley-Terry ratings.
"""
...
def compute_bootstrap_elo(
battles: pd.DataFrame,
num_round: int = 1000,
K: float = 4.0,
) -> pd.DataFrame:
"""Compute bootstrapped Elo ratings for confidence intervals.
Args:
battles: DataFrame with columns model_a, model_b, winner.
num_round: Number of bootstrap resampling rounds.
K: K-factor for each Elo computation.
Returns:
DataFrame with shape (num_round, num_models) of per-round ratings.
"""
...
Import
from fastchat.serve.monitor.rating_systems import compute_elo, compute_bt
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| battles | pd.DataFrame |
Yes | Battle records with columns: model_a, model_b, winner (values: "model_a", "model_b", or "tie")
|
| K | float |
No | Elo K-factor controlling update sensitivity (default: 4.0)
|
| SCALE | float |
No | Logistic scale parameter for expected score computation (default: 400.0)
|
| BASE | float |
No | Logistic base parameter (default: 10.0)
|
| INIT_RATING | float |
No | Initial rating assigned to all models (default: 1000.0)
|
| num_round | int |
No | Number of bootstrap resampling rounds (default: 1000)
|
Outputs
| Name | Type | Description |
|---|---|---|
| elo_ratings | dict[str, float] |
Dictionary mapping each model name to its computed Elo rating |
| bt_ratings | dict[str, float] |
Dictionary mapping each model name to its Bradley-Terry rating |
| bootstrap_df | pd.DataFrame |
DataFrame of shape (num_round, num_models) containing per-round Elo ratings for computing confidence intervals
|
Usage Examples
import pandas as pd
from fastchat.serve.monitor.rating_systems import (
compute_elo,
compute_bt,
compute_bootstrap_elo,
)
# Load cleaned battle data
from fastchat.serve.monitor.clean_battle_data import clean_battle_data
battles = clean_battle_data(["logs/battles.json"])
# Compute sequential Elo ratings
elo_ratings = compute_elo(battles, K=4.0, INIT_RATING=1000.0)
for model, rating in sorted(elo_ratings.items(), key=lambda x: -x[1])[:10]:
print(f" {model}: {rating:.1f}")
# Compute Bradley-Terry ratings
bt_ratings = compute_bt(battles)
for model, rating in sorted(bt_ratings.items(), key=lambda x: -x[1])[:10]:
print(f" {model}: {rating:.1f}")
# Compute bootstrap confidence intervals
bootstrap_df = compute_bootstrap_elo(battles, num_round=1000, K=4.0)
ci_lower = bootstrap_df.quantile(0.025)
ci_upper = bootstrap_df.quantile(0.975)
median = bootstrap_df.median()
for model in median.sort_values(ascending=False).index[:10]:
print(f" {model}: {median[model]:.1f} [{ci_lower[model]:.1f}, {ci_upper[model]:.1f}]")
Related Pages
- Principle:Lm_sys_FastChat_Pairwise_Rating_Computation
- Implements: Principle:Lm_sys_FastChat_Pairwise_Rating_Computation
- Environment:Lm_sys_FastChat_GPU_CUDA_Inference
- Lm_sys_FastChat_Elo_Analysis - Higher-level analysis module that orchestrates rating computation
- Lm_sys_FastChat_Clean_Battle_Data - Produces the cleaned battle data consumed by rating functions
- Lm_sys_FastChat_Monitor_Dashboard - Displays computed ratings in the leaderboard UI