Implementation:Lm sys FastChat Elo Analysis
| Knowledge Sources | |
|---|---|
| Domains | Model_Evaluation, Statistics |
| Last Updated | 2026-02-07 06:00 GMT |
Overview
elo_analysis computes Elo ratings from arena battle data and generates comprehensive visualizations including bootstrap confidence intervals, win-rate heatmaps, and ranking bar charts.
Description
The elo_analysis.py module is the central analytics engine of the Chatbot Arena leaderboard. It takes cleaned battle data and produces Elo ratings for every participating model, along with statistical confidence measures and publication-quality visualizations. This module bridges the gap between raw pairwise comparison data and the leaderboard rankings displayed to users.
The core function compute_elo_results runs the Elo rating algorithm over a battles DataFrame, applying configurable parameters for the K-factor, scale, base, and initial rating. It returns a dictionary containing final ratings, per-model statistics, and intermediate computation artifacts. To quantify uncertainty, the module implements bootstrap resampling via repeated Elo computations on randomly sampled subsets of the battle data, producing confidence intervals for each model's rating.
The report_elo_analysis_results function orchestrates the full analysis workflow: it computes ratings, generates bootstrap confidence intervals, builds win-rate matrices between all model pairs, and produces visualizations. Output includes matplotlib/seaborn heatmaps showing head-to-head win rates and horizontal bar charts showing model rankings with confidence intervals. The get_arena_table function merges Elo ratings with model metadata to produce the final leaderboard table, and pretty_print_elo_rating formats ratings for display.
Usage
Use this module after cleaning battle data with clean_battle_data. It is invoked both by the monitor dashboard for real-time leaderboard updates and by offline analysis scripts for generating research reports and blog post figures. The output can be serialized to JSON for caching and served by the Gradio dashboard.
Code Reference
Source Location
- Repository: Lm_sys_FastChat
- File: fastchat/serve/monitor/elo_analysis.py
- Lines: 1-549
Signature
def compute_elo_results(
battles: pd.DataFrame,
K: float = 4.0,
SCALE: float = 400.0,
BASE: float = 10.0,
INIT_RATING: float = 1000.0,
) -> dict:
"""Compute Elo ratings from a battles DataFrame."""
...
def report_elo_analysis_results(
battles_df: pd.DataFrame,
) -> dict:
"""Run full Elo analysis with bootstrap CIs and visualizations."""
...
def get_arena_table(
arena_df: pd.DataFrame,
model_table_df: pd.DataFrame,
) -> pd.DataFrame:
"""Merge Elo ratings with model metadata for leaderboard display."""
...
def pretty_print_elo_rating(rating: dict) -> str:
"""Format Elo ratings for human-readable display."""
...
Import
from fastchat.serve.monitor.elo_analysis import report_elo_analysis_results
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| battles | pd.DataFrame |
Yes | Cleaned battle records with columns: model_a, model_b, winner
|
| K | float |
No | Elo K-factor controlling rating sensitivity (default: 4.0)
|
| SCALE | float |
No | Elo scale parameter (default: 400.0)
|
| BASE | float |
No | Elo base parameter (default: 10.0)
|
| INIT_RATING | float |
No | Initial Elo rating for new models (default: 1000.0)
|
| arena_df | pd.DataFrame |
Yes (for get_arena_table) | DataFrame of Elo ratings and statistics per model |
| model_table_df | pd.DataFrame |
Yes (for get_arena_table) | Model metadata table with organization, license, and link info |
Outputs
| Name | Type | Description |
|---|---|---|
| elo_results | dict |
Dictionary containing "elo_rating" (dict of model ratings), "bootstrap_ci" (confidence intervals), "win_rate_matrix" (pairwise win rates), and "num_battles" (per-model battle counts)
|
| arena_table | pd.DataFrame |
Merged leaderboard table with Elo ratings, metadata, and rankings |
| figures | matplotlib figures | Heatmaps and bar charts generated during analysis |
Usage Examples
from fastchat.serve.monitor.clean_battle_data import clean_battle_data
from fastchat.serve.monitor.elo_analysis import (
compute_elo_results,
report_elo_analysis_results,
get_arena_table,
pretty_print_elo_rating,
)
# Load and clean battle data
battles_df = clean_battle_data(["logs/battles_2024_01.json"])
# Compute basic Elo ratings
elo_results = compute_elo_results(battles_df, K=4.0)
print(pretty_print_elo_rating(elo_results["elo_rating"]))
# Run full analysis with bootstrap CIs and visualizations
full_results = report_elo_analysis_results(battles_df)
print(f"Number of models: {len(full_results['elo_rating'])}")
print(f"Bootstrap rounds: {len(full_results['bootstrap_ci'])}")
# Generate the leaderboard table
import pandas as pd
model_table_df = pd.read_csv("model_metadata.csv")
arena_table = get_arena_table(
pd.DataFrame(full_results["elo_rating"].items(), columns=["model", "rating"]),
model_table_df,
)
print(arena_table.head(20))
Related Pages
- Principle:Lm_sys_FastChat_Elo_Rating_Analysis
- Implements: Principle:Lm_sys_FastChat_Elo_Rating_Analysis
- Environment:Lm_sys_FastChat_GPU_CUDA_Inference
- Lm_sys_FastChat_Clean_Battle_Data - Produces the cleaned battle data consumed by this module
- Lm_sys_FastChat_Rating_Systems - Alternative rating system implementations
- Lm_sys_FastChat_Monitor_Dashboard - Displays Elo analysis results in the Gradio UI
- Lm_sys_FastChat_Category_Classifier - Provides category labels for per-category Elo analysis