Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Lm sys FastChat Rating Systems

From Leeroopedia


Knowledge Sources
Domains Model_Evaluation, Statistics
Last Updated 2026-02-07 06:00 GMT

Overview

rating_systems.py implements multiple statistical rating systems -- including Elo, Bradley-Terry, and bootstrapped variants -- for computing model strength rankings from pairwise arena battle outcomes.

Description

The rating_systems.py module provides a collection of well-established statistical methods for deriving cardinal ratings from pairwise comparison data. These rating systems form the mathematical backbone of the Chatbot Arena leaderboard, translating raw "model A vs. model B" battle outcomes into meaningful numerical rankings that quantify relative model strength.

The classic Elo rating system, implemented in compute_elo, processes battles sequentially and updates ratings using a logistic expected-score formula. It is parameterized by the K-factor (controlling update magnitude), scale, base, and initial rating. While simple and interpretable, sequential Elo is sensitive to battle ordering. The Bradley-Terry model, implemented in compute_bt, addresses this by solving a maximum likelihood estimation problem over the full battle dataset simultaneously, producing ratings that are invariant to battle ordering and statistically principled.

To quantify uncertainty in the computed ratings, compute_bootstrap_elo performs bootstrap resampling: it repeatedly draws random samples (with replacement) from the battle dataset and computes Elo ratings on each sample, producing a distribution of ratings for every model. The resulting confidence intervals are critical for determining whether apparent rating differences between models are statistically significant. Additional rating computation methods provide variants and extensions tailored to specific analysis needs, such as handling ties, controlling for prompt difficulty, or computing ratings within specific time windows.

Usage

Use this module when you need to compute model ratings from pairwise battle data. Choose compute_elo for quick, sequential computation; compute_bt for statistically rigorous maximum-likelihood ratings; and compute_bootstrap_elo when you need confidence intervals. These functions are typically called by elo_analysis.py rather than invoked directly, but they can be used independently for custom analyses.

Code Reference

Source Location

Signature

def compute_elo(
    battles: pd.DataFrame,
    K: float = 4.0,
    SCALE: float = 400.0,
    BASE: float = 10.0,
    INIT_RATING: float = 1000.0,
) -> dict[str, float]:
    """Compute sequential Elo ratings from battle outcomes.

    Args:
        battles: DataFrame with columns model_a, model_b, winner.
        K: K-factor controlling rating update magnitude.
        SCALE: Logistic scale parameter.
        BASE: Logistic base parameter.
        INIT_RATING: Starting rating for all models.

    Returns:
        Dictionary mapping model names to their Elo ratings.
    """
    ...

def compute_bt(
    battles: pd.DataFrame,
) -> dict[str, float]:
    """Compute Bradley-Terry maximum-likelihood ratings.

    Args:
        battles: DataFrame with columns model_a, model_b, winner.

    Returns:
        Dictionary mapping model names to their Bradley-Terry ratings.
    """
    ...

def compute_bootstrap_elo(
    battles: pd.DataFrame,
    num_round: int = 1000,
    K: float = 4.0,
) -> pd.DataFrame:
    """Compute bootstrapped Elo ratings for confidence intervals.

    Args:
        battles: DataFrame with columns model_a, model_b, winner.
        num_round: Number of bootstrap resampling rounds.
        K: K-factor for each Elo computation.

    Returns:
        DataFrame with shape (num_round, num_models) of per-round ratings.
    """
    ...

Import

from fastchat.serve.monitor.rating_systems import compute_elo, compute_bt

I/O Contract

Inputs

Name Type Required Description
battles pd.DataFrame Yes Battle records with columns: model_a, model_b, winner (values: "model_a", "model_b", or "tie")
K float No Elo K-factor controlling update sensitivity (default: 4.0)
SCALE float No Logistic scale parameter for expected score computation (default: 400.0)
BASE float No Logistic base parameter (default: 10.0)
INIT_RATING float No Initial rating assigned to all models (default: 1000.0)
num_round int No Number of bootstrap resampling rounds (default: 1000)

Outputs

Name Type Description
elo_ratings dict[str, float] Dictionary mapping each model name to its computed Elo rating
bt_ratings dict[str, float] Dictionary mapping each model name to its Bradley-Terry rating
bootstrap_df pd.DataFrame DataFrame of shape (num_round, num_models) containing per-round Elo ratings for computing confidence intervals

Usage Examples

import pandas as pd
from fastchat.serve.monitor.rating_systems import (
    compute_elo,
    compute_bt,
    compute_bootstrap_elo,
)

# Load cleaned battle data
from fastchat.serve.monitor.clean_battle_data import clean_battle_data
battles = clean_battle_data(["logs/battles.json"])

# Compute sequential Elo ratings
elo_ratings = compute_elo(battles, K=4.0, INIT_RATING=1000.0)
for model, rating in sorted(elo_ratings.items(), key=lambda x: -x[1])[:10]:
    print(f"  {model}: {rating:.1f}")

# Compute Bradley-Terry ratings
bt_ratings = compute_bt(battles)
for model, rating in sorted(bt_ratings.items(), key=lambda x: -x[1])[:10]:
    print(f"  {model}: {rating:.1f}")

# Compute bootstrap confidence intervals
bootstrap_df = compute_bootstrap_elo(battles, num_round=1000, K=4.0)
ci_lower = bootstrap_df.quantile(0.025)
ci_upper = bootstrap_df.quantile(0.975)
median = bootstrap_df.median()

for model in median.sort_values(ascending=False).index[:10]:
    print(f"  {model}: {median[model]:.1f} [{ci_lower[model]:.1f}, {ci_upper[model]:.1f}]")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment