Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Lm sys FastChat Arena Battle UI

From Leeroopedia


Field Value
Page Type Principle
Title Arena Battle UI
Repository lm-sys/FastChat
Workflow Arena Model Comparison
Domains Web UI, Model Evaluation
Knowledge Sources fastchat/serve/gradio_block_arena_anony.py, fastchat/serve/gradio_block_arena_named.py, fastchat/serve/gradio_block_arena_vision.py, Gradio documentation
Last Updated 2026-02-07 14:00 GMT

Overview

This principle governs the design and construction of interactive Gradio-based arena comparison interfaces for pairwise model evaluation. Arena battle UIs allow users to submit prompts, receive responses from two models side by side, and cast votes on which response is superior. The architecture supports anonymous and named model pairing modes, vision-enabled variants, content moderation, and conversation turn limits -- all of which are essential for collecting high-quality human preference data at scale.

Description

Anonymous and Named Model Pairing

The arena supports two fundamental pairing modes. In anonymous mode, the identities of the two models are hidden from the user until a vote is cast. This eliminates brand-recognition bias and ensures that judgments are based solely on response quality. In named mode, users explicitly select which two models to compare, enabling targeted evaluation of specific model pairs. Both modes share the same underlying voting infrastructure but differ in how model selection is handled at the UI layer.

Weighted Random Model Selection

In anonymous mode, model pairs are not drawn uniformly at random. Instead, a weighted random selection strategy is employed to prioritize underrepresented or newly added models. This ensures that the resulting preference dataset has adequate coverage across all models in the arena, preventing popular models from dominating the comparison matrix and leaving newer models with insufficient battle counts for reliable rating estimation.

Parallel Response Streaming

To provide a responsive user experience, the arena streams responses from both models concurrently. Two independent inference requests are dispatched to the respective model workers, and their token-by-token outputs are interleaved in the UI. This parallelism ensures that users do not wait for one model to finish before seeing the other, reducing perceived latency and encouraging engagement.

Voting Workflows

After both models have finished responding, the user is presented with four voting options: Left is Better, Right is Better, Tie, and Both Bad. Each vote is recorded as a battle outcome along with metadata (model identities, conversation history, timestamps). The "Both Bad" option is critical for filtering out cases where neither model produces acceptable output, preventing these degenerate battles from distorting downstream rating computations.

Content Moderation Guards

Before a prompt is dispatched to the models, it passes through a content moderation pipeline that checks for policy-violating content. Moderation guards may include keyword-based filters, external moderation API calls, or heuristic rules. If a prompt is flagged, the UI displays a warning and refuses to relay the request, protecting both the service and the collected dataset from toxic or adversarial inputs.

Conversation Turn Limits

To keep battles focused and manageable, the arena enforces a maximum number of conversation turns per battle session. This prevents indefinitely long multi-turn conversations that would be difficult to evaluate and would consume excessive compute. Typical limits are set between one and eight turns, depending on the deployment configuration.

Vision-Enabled Variants

For multimodal models, the arena provides vision-enabled variants of both anonymous and named modes. These interfaces allow users to upload images alongside text prompts, enabling pairwise comparison of vision-language models. The vision variants reuse the same voting and streaming infrastructure, adding image preprocessing and display components to the Gradio layout.

Theoretical Basis

The arena battle UI is grounded in pairwise preference elicitation, a well-established methodology for ranking items from human judgments. Rather than asking humans to assign absolute scores (which are subject to calibration inconsistencies across annotators), pairwise comparison asks only which of two options is better -- a cognitively simpler and more reliable task. The resulting preference data can be aggregated using Bradley-Terry models or Elo rating systems to produce cardinal rankings. Crucially, blind comparison (anonymous mode) eliminates position bias and brand recognition effects, ensuring that the collected preferences reflect genuine quality differences rather than confounding factors. The weighted random selection of model pairs further ensures that the comparison matrix is well-conditioned for downstream statistical estimation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment