Implementation:Lm sys FastChat Rating Systems

Knowledge Sources	Lm_sys_FastChat Chatbot Arena
Domains	Model_Evaluation, Statistics
Last Updated	2026-02-07 06:00 GMT

Overview

rating_systems.py implements multiple statistical rating systems -- including Elo, Bradley-Terry, and bootstrapped variants -- for computing model strength rankings from pairwise arena battle outcomes.

Description

The rating_systems.py module provides a collection of well-established statistical methods for deriving cardinal ratings from pairwise comparison data. These rating systems form the mathematical backbone of the Chatbot Arena leaderboard, translating raw "model A vs. model B" battle outcomes into meaningful numerical rankings that quantify relative model strength.

The classic Elo rating system, implemented in compute_elo, processes battles sequentially and updates ratings using a logistic expected-score formula. It is parameterized by the K-factor (controlling update magnitude), scale, base, and initial rating. While simple and interpretable, sequential Elo is sensitive to battle ordering. The Bradley-Terry model, implemented in compute_bt, addresses this by solving a maximum likelihood estimation problem over the full battle dataset simultaneously, producing ratings that are invariant to battle ordering and statistically principled.

To quantify uncertainty in the computed ratings, compute_bootstrap_elo performs bootstrap resampling: it repeatedly draws random samples (with replacement) from the battle dataset and computes Elo ratings on each sample, producing a distribution of ratings for every model. The resulting confidence intervals are critical for determining whether apparent rating differences between models are statistically significant. Additional rating computation methods provide variants and extensions tailored to specific analysis needs, such as handling ties, controlling for prompt difficulty, or computing ratings within specific time windows.

Usage

Use this module when you need to compute model ratings from pairwise battle data. Choose compute_elo for quick, sequential computation; compute_bt for statistically rigorous maximum-likelihood ratings; and compute_bootstrap_elo when you need confidence intervals. These functions are typically called by elo_analysis.py rather than invoked directly, but they can be used independently for custom analyses.

Code Reference

Source Location

Repository: Lm_sys_FastChat
File: fastchat/serve/monitor/rating_systems.py
Lines: 1-385

Signature

def compute_elo(
    battles: pd.DataFrame,
    K: float = 4.0,
    SCALE: float = 400.0,
    BASE: float = 10.0,
    INIT_RATING: float = 1000.0,
) -> dict[str, float]:
    """Compute sequential Elo ratings from battle outcomes.

    Args:
        battles: DataFrame with columns model_a, model_b, winner.
        K: K-factor controlling rating update magnitude.
        SCALE: Logistic scale parameter.
        BASE: Logistic base parameter.
        INIT_RATING: Starting rating for all models.

    Returns:
        Dictionary mapping model names to their Elo ratings.
    """
    ...

def compute_bt(
    battles: pd.DataFrame,
) -> dict[str, float]:
    """Compute Bradley-Terry maximum-likelihood ratings.

    Args:
        battles: DataFrame with columns model_a, model_b, winner.

    Returns:
        Dictionary mapping model names to their Bradley-Terry ratings.
    """
    ...

def compute_bootstrap_elo(
    battles: pd.DataFrame,
    num_round: int = 1000,
    K: float = 4.0,
) -> pd.DataFrame:
    """Compute bootstrapped Elo ratings for confidence intervals.

    Args:
        battles: DataFrame with columns model_a, model_b, winner.
        num_round: Number of bootstrap resampling rounds.
        K: K-factor for each Elo computation.

    Returns:
        DataFrame with shape (num_round, num_models) of per-round ratings.
    """
    ...

Import

from fastchat.serve.monitor.rating_systems import compute_elo, compute_bt

I/O Contract

Inputs

Name	Type	Required	Description
battles	`pd.DataFrame`	Yes	Battle records with columns: `model_a`, `model_b`, `winner` (values: `"model_a"`, `"model_b"`, or `"tie"`)
K	`float`	No	Elo K-factor controlling update sensitivity (default: `4.0`)
SCALE	`float`	No	Logistic scale parameter for expected score computation (default: `400.0`)
BASE	`float`	No	Logistic base parameter (default: `10.0`)
INIT_RATING	`float`	No	Initial rating assigned to all models (default: `1000.0`)
num_round	`int`	No	Number of bootstrap resampling rounds (default: `1000`)

Outputs

Name	Type	Description
elo_ratings	`dict[str, float]`	Dictionary mapping each model name to its computed Elo rating
bt_ratings	`dict[str, float]`	Dictionary mapping each model name to its Bradley-Terry rating
bootstrap_df	`pd.DataFrame`	DataFrame of shape `(num_round, num_models)` containing per-round Elo ratings for computing confidence intervals

Usage Examples

import pandas as pd
from fastchat.serve.monitor.rating_systems import (
    compute_elo,
    compute_bt,
    compute_bootstrap_elo,
)

# Load cleaned battle data
from fastchat.serve.monitor.clean_battle_data import clean_battle_data
battles = clean_battle_data(["logs/battles.json"])

# Compute sequential Elo ratings
elo_ratings = compute_elo(battles, K=4.0, INIT_RATING=1000.0)
for model, rating in sorted(elo_ratings.items(), key=lambda x: -x[1])[:10]:
    print(f"  {model}: {rating:.1f}")

# Compute Bradley-Terry ratings
bt_ratings = compute_bt(battles)
for model, rating in sorted(bt_ratings.items(), key=lambda x: -x[1])[:10]:
    print(f"  {model}: {rating:.1f}")

# Compute bootstrap confidence intervals
bootstrap_df = compute_bootstrap_elo(battles, num_round=1000, K=4.0)
ci_lower = bootstrap_df.quantile(0.025)
ci_upper = bootstrap_df.quantile(0.975)
median = bootstrap_df.median()

for model in median.sort_values(ascending=False).index[:10]:
    print(f"  {model}: {median[model]:.1f} [{ci_lower[model]:.1f}, {ci_upper[model]:.1f}]")

Related Pages

Principle:Lm_sys_FastChat_Pairwise_Rating_Computation
Implements: Principle:Lm_sys_FastChat_Pairwise_Rating_Computation
Environment:Lm_sys_FastChat_GPU_CUDA_Inference
Lm_sys_FastChat_Elo_Analysis - Higher-level analysis module that orchestrates rating computation
Lm_sys_FastChat_Clean_Battle_Data - Produces the cleaned battle data consumed by rating functions
Lm_sys_FastChat_Monitor_Dashboard - Displays computed ratings in the leaderboard UI

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment