Principle:Lm sys FastChat Pairwise Rating Computation
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Pairwise Rating Computation |
| Repository | lm-sys/FastChat |
| Workflow | Arena Data Analysis |
| Domains | Statistics, Model Evaluation |
| Knowledge Sources | fastchat/serve/monitor/rating_systems.py |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This principle defines the mathematical and algorithmic foundations for computing cardinal ratings from pairwise comparison outcomes. When users vote on which of two models produced the better response, the result is an ordinal preference. Rating computation transforms these ordinal pairwise outcomes into cardinal scores that enable global ranking of all models. The two primary methods are sequential Elo rating and Bradley-Terry maximum likelihood estimation, both augmented with bootstrap resampling for uncertainty quantification.
Description
Sequential Elo Rating
The Elo rating system processes battles sequentially, updating model ratings after each contest. Each model starts with an initial rating (typically 1000 or 1500). For a battle between model A (rating R_A) and model B (rating R_B), the expected score for model A is:
E_A = 1 / (1 + 10^((R_B - R_A) / 400))
After the battle, the actual score S_A is 1 for a win, 0 for a loss, or 0.5 for a tie. The rating is updated as:
R_A_new = R_A + K * (S_A - E_A)
The K-factor controls the sensitivity of ratings to individual battles. A larger K makes ratings more responsive but also more volatile; a smaller K produces more stable ratings but adapts slowly to changes. The choice of K-factor involves a trade-off between responsiveness and stability. In the arena context, a moderate K-factor (e.g., 4 to 32) is typically used, with the specific value tuned based on the volume of battles and the desired convergence behavior.
Bradley-Terry Maximum Likelihood Estimation
The Bradley-Terry model provides a principled statistical framework for pairwise comparison. It assumes that each model i has a latent strength parameter p_i > 0, and the probability that model i defeats model j is:
P(i beats j) = p_i / (p_i + p_j)
Given a dataset of observed battle outcomes, the model parameters are estimated by maximizing the log-likelihood of the observed data. This yields a convex optimization problem that can be solved efficiently using iterative algorithms (e.g., the minorization-maximization algorithm or direct gradient-based optimization). The resulting parameters are converted to a rating scale (e.g., by taking rating_i = 400 * log10(p_i) + base) for interpretability. Bradley-Terry is preferred over sequential Elo when the full battle dataset is available, as MLE uses all data simultaneously rather than depending on the order in which battles are processed.
Bootstrap Resampling for Confidence Intervals
To quantify the uncertainty in computed ratings, the system employs bootstrap resampling (Efron, 1979). The procedure is:
- Sample the battle dataset with replacement to create a bootstrap sample of the same size.
- Compute ratings (via Elo or Bradley-Terry) on the bootstrap sample.
- Repeat steps 1-2 for a large number of iterations (e.g., 100 to 1000).
- The distribution of bootstrap rating estimates yields confidence intervals for each model.
Bootstrap confidence intervals are non-parametric: they do not assume a specific distributional form for the rating estimates, making them robust to model misspecification. The width of the confidence interval for a given model is inversely related to the number of battles involving that model.
Tie Handling
Ties require special treatment in both Elo and Bradley-Terry frameworks. In the Elo system, a tie is treated as half a win for each model (i.e., S_A = S_B = 0.5). In the Bradley-Terry framework, ties can be handled by the Davidson extension, which introduces an additional parameter modeling the probability of a tie as a function of the closeness of two models' strengths. Alternatively, ties can be split into half-wins, which is mathematically equivalent to the Elo treatment and avoids the need for an additional model parameter.
K-Factor Tuning
The K-factor in sequential Elo is a critical hyperparameter. In the arena system, K is tuned to balance two objectives: convergence speed (reaching stable ratings quickly for new models) and rating stability (avoiding excessive fluctuation for established models). Adaptive K-factor schemes may assign higher K values to models with fewer battles (analogous to provisional ratings in chess) and lower K values to models with extensive battle histories.
Theoretical Basis
The Elo rating system (Elo, 1978) was developed for chess and models each match as a Bernoulli trial with a logistic probability function. The key assumption is that the probability of one player defeating another depends only on the difference in their ratings, with the logistic function providing the link between rating differences and win probabilities. The Bradley-Terry model (Bradley and Terry, 1952) provides the maximum likelihood estimation (MLE) framework that generalizes Elo to batch estimation. While sequential Elo can be viewed as an online stochastic approximation to the Bradley-Terry MLE, the batch estimator is more statistically efficient because it uses all observations simultaneously and does not depend on the ordering of battles. Bootstrap resampling (Efron, 1979) provides non-parametric confidence intervals by simulating the sampling distribution of the rating estimates through repeated resampling of the observed data. The theoretical justification rests on the bootstrap consistency theorem, which guarantees that the bootstrap distribution converges to the true sampling distribution under mild regularity conditions. Together, these three components -- Elo for online updates, Bradley-Terry for batch estimation, and bootstrap for uncertainty quantification -- form a complete statistical framework for ranking models from pairwise human preferences.