Principle:Lm sys FastChat Battle Data Cleaning
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Battle Data Cleaning |
| Repository | lm-sys/FastChat |
| Workflow | Arena Data Analysis |
| Domains | Data Processing, Statistics |
| Knowledge Sources | fastchat/serve/monitor/clean_battle_data.py |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This principle addresses the systematic cleaning and validation of raw arena battle logs before they are used for statistical analysis and model rating computation. Raw battle data collected from the arena UI contains noise in the form of duplicate entries, inconsistent model naming, invalid outcomes, and bot or adversarial traffic. Without rigorous cleaning, downstream rating systems such as Elo and Bradley-Terry will produce biased or unreliable rankings. This principle defines the transformations and filters required to produce a clean, analysis-ready battle dataset.
Description
Deduplication
Users may inadvertently (or deliberately) submit duplicate battles through page refreshes, network retries, or automated scripts. Deduplication identifies and removes these duplicate entries by comparing combinations of conversation content, timestamps, and session identifiers. Exact duplicates are straightforward to detect, while near-duplicates (e.g., same prompt with minor timestamp differences) require fuzzy matching heuristics. Failing to deduplicate inflates the apparent number of battles and can bias ratings toward models that appear in duplicated entries.
Model Name Normalization
Model names in raw battle logs may vary due to versioning, aliasing, or typos -- for example, gpt-4-0314 vs. gpt-4 vs. GPT4. Model name normalization maps all variant names to a canonical form using a maintained lookup table. This ensures that battles involving the same model are correctly aggregated, preventing a single model from being split across multiple rating entries.
Anonymization
Before sharing battle data publicly or using it for reproducible research, anonymization removes or hashes personally identifiable information (PII) such as IP addresses, session tokens, and user identifiers. Anonymization is applied after deduplication (which may rely on session identifiers) but before any downstream analysis or data export.
Date Range Filtering
Not all historical battles are relevant for every analysis. Date range filtering restricts the dataset to battles within a specified time window. This is important when computing current ratings (where very old battles may reflect outdated model versions), when analyzing trends over time, or when excluding periods affected by known data collection issues.
Outcome Validation
Each battle record must have a valid outcome: model A wins, model B wins, tie, or both bad. Outcome validation checks that the recorded outcome is one of these valid values and that it is consistent with the battle metadata (e.g., a battle cannot have a winner that is not one of the two participating models). Records with missing, corrupted, or inconsistent outcomes are discarded.
Handling of Ties and Invalid Conversations
Ties and "both bad" outcomes require special handling. While ties carry useful information (they indicate that two models are closely matched), an excess of ties from low-effort voting can dilute the signal. The cleaning pipeline may apply heuristics to filter out suspicious tie patterns, such as users who vote tie on every battle. Similarly, battles flagged by content moderation or those with extremely short conversations (indicating the user did not genuinely engage) are filtered as invalid.
Theoretical Basis
Statistical estimation of model ratings from pairwise comparison data assumes that each observation is an independent, unbiased sample from the true preference distribution. Duplicate entries violate the independence assumption, inflating certain comparisons and biasing maximum likelihood estimates. Inconsistent model naming creates fragmented identities that reduce the effective sample size for each model and distort the comparison graph. Outcome validation ensures that the data conforms to the sample space assumed by the statistical model (e.g., Bradley-Terry assumes each observation is a win or loss, with ties handled by a specific extension). Systematic data cleaning is therefore not merely a practical convenience but a statistical prerequisite for unbiased and consistent rating estimation.