Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Lm sys FastChat Elo Analysis

From Leeroopedia


Knowledge Sources
Domains Model_Evaluation, Statistics
Last Updated 2026-02-07 06:00 GMT

Overview

elo_analysis computes Elo ratings from arena battle data and generates comprehensive visualizations including bootstrap confidence intervals, win-rate heatmaps, and ranking bar charts.

Description

The elo_analysis.py module is the central analytics engine of the Chatbot Arena leaderboard. It takes cleaned battle data and produces Elo ratings for every participating model, along with statistical confidence measures and publication-quality visualizations. This module bridges the gap between raw pairwise comparison data and the leaderboard rankings displayed to users.

The core function compute_elo_results runs the Elo rating algorithm over a battles DataFrame, applying configurable parameters for the K-factor, scale, base, and initial rating. It returns a dictionary containing final ratings, per-model statistics, and intermediate computation artifacts. To quantify uncertainty, the module implements bootstrap resampling via repeated Elo computations on randomly sampled subsets of the battle data, producing confidence intervals for each model's rating.

The report_elo_analysis_results function orchestrates the full analysis workflow: it computes ratings, generates bootstrap confidence intervals, builds win-rate matrices between all model pairs, and produces visualizations. Output includes matplotlib/seaborn heatmaps showing head-to-head win rates and horizontal bar charts showing model rankings with confidence intervals. The get_arena_table function merges Elo ratings with model metadata to produce the final leaderboard table, and pretty_print_elo_rating formats ratings for display.

Usage

Use this module after cleaning battle data with clean_battle_data. It is invoked both by the monitor dashboard for real-time leaderboard updates and by offline analysis scripts for generating research reports and blog post figures. The output can be serialized to JSON for caching and served by the Gradio dashboard.

Code Reference

Source Location

Signature

def compute_elo_results(
    battles: pd.DataFrame,
    K: float = 4.0,
    SCALE: float = 400.0,
    BASE: float = 10.0,
    INIT_RATING: float = 1000.0,
) -> dict:
    """Compute Elo ratings from a battles DataFrame."""
    ...

def report_elo_analysis_results(
    battles_df: pd.DataFrame,
) -> dict:
    """Run full Elo analysis with bootstrap CIs and visualizations."""
    ...

def get_arena_table(
    arena_df: pd.DataFrame,
    model_table_df: pd.DataFrame,
) -> pd.DataFrame:
    """Merge Elo ratings with model metadata for leaderboard display."""
    ...

def pretty_print_elo_rating(rating: dict) -> str:
    """Format Elo ratings for human-readable display."""
    ...

Import

from fastchat.serve.monitor.elo_analysis import report_elo_analysis_results

I/O Contract

Inputs

Name Type Required Description
battles pd.DataFrame Yes Cleaned battle records with columns: model_a, model_b, winner
K float No Elo K-factor controlling rating sensitivity (default: 4.0)
SCALE float No Elo scale parameter (default: 400.0)
BASE float No Elo base parameter (default: 10.0)
INIT_RATING float No Initial Elo rating for new models (default: 1000.0)
arena_df pd.DataFrame Yes (for get_arena_table) DataFrame of Elo ratings and statistics per model
model_table_df pd.DataFrame Yes (for get_arena_table) Model metadata table with organization, license, and link info

Outputs

Name Type Description
elo_results dict Dictionary containing "elo_rating" (dict of model ratings), "bootstrap_ci" (confidence intervals), "win_rate_matrix" (pairwise win rates), and "num_battles" (per-model battle counts)
arena_table pd.DataFrame Merged leaderboard table with Elo ratings, metadata, and rankings
figures matplotlib figures Heatmaps and bar charts generated during analysis

Usage Examples

from fastchat.serve.monitor.clean_battle_data import clean_battle_data
from fastchat.serve.monitor.elo_analysis import (
    compute_elo_results,
    report_elo_analysis_results,
    get_arena_table,
    pretty_print_elo_rating,
)

# Load and clean battle data
battles_df = clean_battle_data(["logs/battles_2024_01.json"])

# Compute basic Elo ratings
elo_results = compute_elo_results(battles_df, K=4.0)
print(pretty_print_elo_rating(elo_results["elo_rating"]))

# Run full analysis with bootstrap CIs and visualizations
full_results = report_elo_analysis_results(battles_df)
print(f"Number of models: {len(full_results['elo_rating'])}")
print(f"Bootstrap rounds: {len(full_results['bootstrap_ci'])}")

# Generate the leaderboard table
import pandas as pd
model_table_df = pd.read_csv("model_metadata.csv")
arena_table = get_arena_table(
    pd.DataFrame(full_results["elo_rating"].items(), columns=["model", "rating"]),
    model_table_df,
)
print(arena_table.head(20))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment