Implementation:Lm sys FastChat Elo Analysis

Knowledge Sources	Lm_sys_FastChat Chatbot Arena
Domains	Model_Evaluation, Statistics
Last Updated	2026-02-07 06:00 GMT

Overview

elo_analysis computes Elo ratings from arena battle data and generates comprehensive visualizations including bootstrap confidence intervals, win-rate heatmaps, and ranking bar charts.

Description

The elo_analysis.py module is the central analytics engine of the Chatbot Arena leaderboard. It takes cleaned battle data and produces Elo ratings for every participating model, along with statistical confidence measures and publication-quality visualizations. This module bridges the gap between raw pairwise comparison data and the leaderboard rankings displayed to users.

The core function compute_elo_results runs the Elo rating algorithm over a battles DataFrame, applying configurable parameters for the K-factor, scale, base, and initial rating. It returns a dictionary containing final ratings, per-model statistics, and intermediate computation artifacts. To quantify uncertainty, the module implements bootstrap resampling via repeated Elo computations on randomly sampled subsets of the battle data, producing confidence intervals for each model's rating.

The report_elo_analysis_results function orchestrates the full analysis workflow: it computes ratings, generates bootstrap confidence intervals, builds win-rate matrices between all model pairs, and produces visualizations. Output includes matplotlib/seaborn heatmaps showing head-to-head win rates and horizontal bar charts showing model rankings with confidence intervals. The get_arena_table function merges Elo ratings with model metadata to produce the final leaderboard table, and pretty_print_elo_rating formats ratings for display.

Usage

Use this module after cleaning battle data with clean_battle_data. It is invoked both by the monitor dashboard for real-time leaderboard updates and by offline analysis scripts for generating research reports and blog post figures. The output can be serialized to JSON for caching and served by the Gradio dashboard.

Code Reference

Source Location

Repository: Lm_sys_FastChat
File: fastchat/serve/monitor/elo_analysis.py
Lines: 1-549

Signature

def compute_elo_results(
    battles: pd.DataFrame,
    K: float = 4.0,
    SCALE: float = 400.0,
    BASE: float = 10.0,
    INIT_RATING: float = 1000.0,
) -> dict:
    """Compute Elo ratings from a battles DataFrame."""
    ...

def report_elo_analysis_results(
    battles_df: pd.DataFrame,
) -> dict:
    """Run full Elo analysis with bootstrap CIs and visualizations."""
    ...

def get_arena_table(
    arena_df: pd.DataFrame,
    model_table_df: pd.DataFrame,
) -> pd.DataFrame:
    """Merge Elo ratings with model metadata for leaderboard display."""
    ...

def pretty_print_elo_rating(rating: dict) -> str:
    """Format Elo ratings for human-readable display."""
    ...

Import

from fastchat.serve.monitor.elo_analysis import report_elo_analysis_results

I/O Contract

Inputs

Name	Type	Required	Description
battles	`pd.DataFrame`	Yes	Cleaned battle records with columns: `model_a`, `model_b`, `winner`
K	`float`	No	Elo K-factor controlling rating sensitivity (default: `4.0`)
SCALE	`float`	No	Elo scale parameter (default: `400.0`)
BASE	`float`	No	Elo base parameter (default: `10.0`)
INIT_RATING	`float`	No	Initial Elo rating for new models (default: `1000.0`)
arena_df	`pd.DataFrame`	Yes (for get_arena_table)	DataFrame of Elo ratings and statistics per model
model_table_df	`pd.DataFrame`	Yes (for get_arena_table)	Model metadata table with organization, license, and link info

Outputs

Name	Type	Description
elo_results	`dict`	Dictionary containing `"elo_rating"` (dict of model ratings), `"bootstrap_ci"` (confidence intervals), `"win_rate_matrix"` (pairwise win rates), and `"num_battles"` (per-model battle counts)
arena_table	`pd.DataFrame`	Merged leaderboard table with Elo ratings, metadata, and rankings
figures	matplotlib figures	Heatmaps and bar charts generated during analysis

Usage Examples

from fastchat.serve.monitor.clean_battle_data import clean_battle_data
from fastchat.serve.monitor.elo_analysis import (
    compute_elo_results,
    report_elo_analysis_results,
    get_arena_table,
    pretty_print_elo_rating,
)

# Load and clean battle data
battles_df = clean_battle_data(["logs/battles_2024_01.json"])

# Compute basic Elo ratings
elo_results = compute_elo_results(battles_df, K=4.0)
print(pretty_print_elo_rating(elo_results["elo_rating"]))

# Run full analysis with bootstrap CIs and visualizations
full_results = report_elo_analysis_results(battles_df)
print(f"Number of models: {len(full_results['elo_rating'])}")
print(f"Bootstrap rounds: {len(full_results['bootstrap_ci'])}")

# Generate the leaderboard table
import pandas as pd
model_table_df = pd.read_csv("model_metadata.csv")
arena_table = get_arena_table(
    pd.DataFrame(full_results["elo_rating"].items(), columns=["model", "rating"]),
    model_table_df,
)
print(arena_table.head(20))

Related Pages

Principle:Lm_sys_FastChat_Elo_Rating_Analysis
Implements: Principle:Lm_sys_FastChat_Elo_Rating_Analysis
Environment:Lm_sys_FastChat_GPU_CUDA_Inference
Lm_sys_FastChat_Clean_Battle_Data - Produces the cleaned battle data consumed by this module
Lm_sys_FastChat_Rating_Systems - Alternative rating system implementations
Lm_sys_FastChat_Monitor_Dashboard - Displays Elo analysis results in the Gradio UI
Lm_sys_FastChat_Category_Classifier - Provides category labels for per-category Elo analysis

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment