Principle:Lm sys FastChat Arena Battle UI
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Arena Battle UI |
| Repository | lm-sys/FastChat |
| Workflow | Arena Model Comparison |
| Domains | Web UI, Model Evaluation |
| Knowledge Sources | fastchat/serve/gradio_block_arena_anony.py, fastchat/serve/gradio_block_arena_named.py, fastchat/serve/gradio_block_arena_vision.py, Gradio documentation |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This principle governs the design and construction of interactive Gradio-based arena comparison interfaces for pairwise model evaluation. Arena battle UIs allow users to submit prompts, receive responses from two models side by side, and cast votes on which response is superior. The architecture supports anonymous and named model pairing modes, vision-enabled variants, content moderation, and conversation turn limits -- all of which are essential for collecting high-quality human preference data at scale.
Description
Anonymous and Named Model Pairing
The arena supports two fundamental pairing modes. In anonymous mode, the identities of the two models are hidden from the user until a vote is cast. This eliminates brand-recognition bias and ensures that judgments are based solely on response quality. In named mode, users explicitly select which two models to compare, enabling targeted evaluation of specific model pairs. Both modes share the same underlying voting infrastructure but differ in how model selection is handled at the UI layer.
Weighted Random Model Selection
In anonymous mode, model pairs are not drawn uniformly at random. Instead, a weighted random selection strategy is employed to prioritize underrepresented or newly added models. This ensures that the resulting preference dataset has adequate coverage across all models in the arena, preventing popular models from dominating the comparison matrix and leaving newer models with insufficient battle counts for reliable rating estimation.
Parallel Response Streaming
To provide a responsive user experience, the arena streams responses from both models concurrently. Two independent inference requests are dispatched to the respective model workers, and their token-by-token outputs are interleaved in the UI. This parallelism ensures that users do not wait for one model to finish before seeing the other, reducing perceived latency and encouraging engagement.
Voting Workflows
After both models have finished responding, the user is presented with four voting options: Left is Better, Right is Better, Tie, and Both Bad. Each vote is recorded as a battle outcome along with metadata (model identities, conversation history, timestamps). The "Both Bad" option is critical for filtering out cases where neither model produces acceptable output, preventing these degenerate battles from distorting downstream rating computations.
Content Moderation Guards
Before a prompt is dispatched to the models, it passes through a content moderation pipeline that checks for policy-violating content. Moderation guards may include keyword-based filters, external moderation API calls, or heuristic rules. If a prompt is flagged, the UI displays a warning and refuses to relay the request, protecting both the service and the collected dataset from toxic or adversarial inputs.
Conversation Turn Limits
To keep battles focused and manageable, the arena enforces a maximum number of conversation turns per battle session. This prevents indefinitely long multi-turn conversations that would be difficult to evaluate and would consume excessive compute. Typical limits are set between one and eight turns, depending on the deployment configuration.
Vision-Enabled Variants
For multimodal models, the arena provides vision-enabled variants of both anonymous and named modes. These interfaces allow users to upload images alongside text prompts, enabling pairwise comparison of vision-language models. The vision variants reuse the same voting and streaming infrastructure, adding image preprocessing and display components to the Gradio layout.
Theoretical Basis
The arena battle UI is grounded in pairwise preference elicitation, a well-established methodology for ranking items from human judgments. Rather than asking humans to assign absolute scores (which are subject to calibration inconsistencies across annotators), pairwise comparison asks only which of two options is better -- a cognitively simpler and more reliable task. The resulting preference data can be aggregated using Bradley-Terry models or Elo rating systems to produce cardinal rankings. Crucially, blind comparison (anonymous mode) eliminates position bias and brand recognition effects, ensuring that the collected preferences reflect genuine quality differences rather than confounding factors. The weighted random selection of model pairs further ensures that the comparison matrix is well-conditioned for downstream statistical estimation.
Related Pages
- Implementation:Lm_sys_FastChat_Build_Side_By_Side_Arena_Anony_UI
- Implemented by: Implementation:Lm_sys_FastChat_Build_Side_By_Side_Arena_Anony_UI
- Implementation:Lm_sys_FastChat_Build_Side_By_Side_Arena_Named_UI
- Implemented by: Implementation:Lm_sys_FastChat_Build_Side_By_Side_Arena_Named_UI
- Implementation:Lm_sys_FastChat_Build_Single_Model_Vision_UI
- Implemented by: Implementation:Lm_sys_FastChat_Build_Single_Model_Vision_UI
- Implementation:Lm_sys_FastChat_Build_Side_By_Side_Vision_Anony_UI
- Implemented by: Implementation:Lm_sys_FastChat_Build_Side_By_Side_Vision_Anony_UI
- Implementation:Lm_sys_FastChat_Build_Side_By_Side_Vision_Named_UI
- Implemented by: Implementation:Lm_sys_FastChat_Build_Side_By_Side_Vision_Named_UI
- Implementation:Lm_sys_FastChat_Copilot_Arena_Tab
- Implemented by: Implementation:Lm_sys_FastChat_Copilot_Arena_Tab