Principle:Lm sys FastChat Prompt Deduplication
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Prompt Deduplication |
| Repository | lm-sys/FastChat |
| Workflow | Arena_Data_Analysis |
| Domains | Data_Processing, Statistics |
| Knowledge Sources | fastchat/serve/monitor/deduplication.py |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This principle addresses the detection and removal of duplicate or near-duplicate prompts from Arena battle datasets. Duplicate prompts arise when users submit the same query multiple times (intentionally or due to interface retries), when popular benchmark prompts are widely shared, or when automated scripts interact with the Arena. Deduplication ensures that each comparison in the dataset contributes unique, independent information to model rating computations and downstream analyses.
Description
Exact String Matching
The simplest form of deduplication identifies prompts that are character-for-character identical. Exact matching is computationally efficient and catches the most obvious duplicates: repeated submissions from the same user session, copy-pasted benchmark prompts, and retried requests. Implementation typically involves sorting or hashing all prompts and grouping those with identical hash values, retaining only the first occurrence (by timestamp) within each group.
Normalized Comparison
Many near-duplicate prompts differ only in trivial formatting: extra whitespace, inconsistent capitalization, or trailing punctuation. Normalized comparison applies a series of transformations before matching:
- Lowercasing: Converting all text to lowercase eliminates case-only variants (e.g., "Explain quantum computing" vs. "explain quantum computing").
- Whitespace normalization: Collapsing multiple spaces, tabs, and newlines into single spaces removes formatting-only differences.
- Punctuation stripping: Optionally removing trailing punctuation catches variants like "What is AI?" vs. "What is AI".
After normalization, prompts are compared using exact matching on the normalized form. This approach significantly increases deduplication recall while maintaining high precision.
Hash-Based Deduplication for Scalability
For large-scale datasets (hundreds of thousands or millions of battles), pairwise string comparison is computationally prohibitive. Hash-based deduplication addresses this by computing a fixed-size hash (e.g., SHA-256 or MD5) of each normalized prompt string. Duplicate detection then reduces to finding collisions in the hash table, which operates in O(n) time and O(n) space. For datasets where near-duplicate detection beyond normalization is required, locality-sensitive hashing (MinHash) can be employed to identify prompts that share a high proportion of character n-grams.
Deduplication Scope and Policy
The deduplication scope must be carefully defined. Options include:
- Global deduplication: Remove all duplicate prompts across the entire dataset, regardless of which models were compared. This maximizes independence but may reduce dataset size significantly.
- Per-model-pair deduplication: Remove duplicates only within battles involving the same model pair, preserving instances where the same prompt was used to compare different model pairs.
- Temporal windowing: Remove duplicates only within a time window (e.g., same day), allowing the same prompt to appear in different analysis periods.
The choice of scope depends on the downstream use case: global deduplication is appropriate for rating computation, while per-model-pair deduplication may be preferred for per-model analysis.
Theoretical Basis
Duplicate prompts inflate the apparent sample size without adding independent information, biasing rating estimates toward models that happen to perform well on the duplicated prompts. In the statistical framework of pairwise comparison (e.g., Bradley-Terry models or Elo rating), each observation is assumed to be an independent draw from the population of possible comparisons. Duplicated prompts violate this independence assumption: if a model consistently wins on a particular prompt that appears k times, that single prompt contributes k times the influence it should have on the model's rating. Deduplication restores the independence assumption by ensuring each unique prompt contributes at most once to the rating computation. From an information-theoretic perspective, duplicate observations carry zero additional mutual information about model quality beyond the first occurrence, making their inclusion purely wasteful in terms of statistical efficiency. Normalized comparison extends this reasoning by treating semantically identical prompts (differing only in formatting) as carrying the same information content.