Principle:Lm sys FastChat Elo Rating Analysis
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Elo Rating Analysis |
| Repository | lm-sys/FastChat |
| Workflow | Arena Data Analysis |
| Domains | Statistics, Model Evaluation |
| Knowledge Sources | fastchat/serve/monitor/elo_analysis.py |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This principle covers the orchestration of Elo and Bradley-Terry rating computation and visualization across model categories. While the underlying rating algorithms are defined separately, this principle addresses the higher-level analysis pipeline: aggregating cleaned battle data, computing ratings per category, generating bootstrap confidence intervals, producing leaderboard tables, and visualizing rating timelines. It serves as the analytical backbone of the arena monitoring system.
Description
Battle Data Aggregation
Before ratings can be computed, cleaned battle records must be aggregated into a structured format suitable for the rating algorithms. This involves grouping battles by category (e.g., "Coding", "Math", "Overall"), constructing model-pair battle count matrices, and computing win/loss/tie tallies. The aggregation step also determines which models have sufficient battle counts to receive reliable ratings -- models with too few battles may be excluded or flagged with high uncertainty.
Per-Category Rating Computation
The arena evaluates models not only overall but also across specific capability categories. For each category (e.g., reasoning, coding, creative writing, multilingual), the pipeline filters battles to those tagged with the relevant category label and computes independent ratings. This produces a multi-dimensional view of model performance, revealing that a model may rank highly overall but underperform in specific domains. The rating computation invokes the underlying Elo or Bradley-Terry algorithm on each category's battle subset.
Bootstrap Confidence Intervals
Point estimates of model ratings are insufficient without a measure of statistical uncertainty. The analysis pipeline employs bootstrap resampling: the battle dataset is resampled with replacement many times (typically 100 to 1000 iterations), and ratings are recomputed for each resample. The resulting distribution of rating estimates yields confidence intervals (e.g., 95% CI) for each model. Models whose confidence intervals overlap cannot be reliably distinguished in ranking, providing users with a principled measure of the leaderboard's precision.
Rating Timeline Visualization
Model ratings evolve over time as new battles are collected and as new models enter the arena. The pipeline generates rating timeline plots that show each model's rating trajectory over time. These visualizations reveal trends such as a model's rating stabilizing after sufficient battles, seasonal fluctuations in voting patterns, or the impact of adding a new strong model that shifts the relative standings of existing models.
Leaderboard Table Generation
The final output of the analysis pipeline is a leaderboard table that ranks models by their computed ratings within each category. The table includes the model name, rating point estimate, confidence interval, number of battles, and win rate. This table is formatted for display in the Gradio monitoring dashboard and may also be exported as CSV or markdown for external consumption. The leaderboard serves as the primary public-facing artifact of the arena evaluation system.
Theoretical Basis
The Elo rating system, originally developed by Arpad Elo for chess, models each pairwise contest as a Bernoulli trial where the probability of one player defeating another is a logistic function of their rating difference. The Bradley-Terry model provides the maximum likelihood estimation framework that underpins Elo: given a set of pairwise outcomes, the model finds the rating vector that maximizes the likelihood of the observed data. When applied to LLM evaluation, each "player" is a model and each "game" is an arena battle. Bootstrap resampling (Efron, 1979) provides a non-parametric method for estimating the sampling distribution of any statistic -- here, the model ratings. By repeatedly resampling the battle data and recomputing ratings, the bootstrap yields confidence intervals without requiring distributional assumptions beyond those of the rating model itself. The combination of Bradley-Terry estimation with bootstrap confidence intervals provides both point estimates and uncertainty quantification for model rankings.