Implementation:Lm sys FastChat Deduplication
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Model_Evaluation |
| Last Updated | 2026-02-07 06:00 GMT |
Overview
Identifies and tags high-frequency prompts in FastChat conversation data using percentile-based cutoffs to support deduplication sampling.
Description
Deduplication is a data quality module that detects high-frequency prompts which may indicate bot traffic, automated testing, or popular copy-pasted queries. Rather than removing duplicates outright, the module tags conversations with a dedup_tag column that downstream processes can use for stratified sampling or filtering. This approach preserves all original data while providing the metadata needed for intelligent deduplication.
The module works by computing the frequency of each unique prompt across the entire conversation corpus. It then applies a configurable percentile cutoff to determine which prompts qualify as "high-frequency." Prompts above the threshold receive a dedup tag, allowing downstream consumers to either exclude them entirely or sample them at a reduced rate. This percentile-based approach adapts automatically to the data distribution rather than relying on fixed frequency thresholds.
The deduplication script is designed to be run as a standalone process that reads conversation data, computes frequency statistics, applies tags, and writes the annotated output. It integrates with the broader dataset release pipeline and is used by both the Arena 33K and Chat 1M filter modules to identify TOO_FREQUENT entries.
Usage
Use this module as a preprocessing step before dataset release or statistical analysis when you suspect the data contains high-frequency duplicate prompts. It is especially important for arena data where popular benchmark questions or automated scripts can create skewed distributions.
Code Reference
Source Location
- Repository: Lm_sys_FastChat
- File: fastchat/serve/monitor/deduplication.py
- Lines: 1-85
Signature
# Main script execution
# Reads conversation data, computes prompt frequency,
# applies percentile-based cutoff, and adds dedup_tag column.
#
# Key operations:
# prompt_counts = df["prompt"].value_counts()
# threshold = prompt_counts.quantile(percentile_cutoff)
# df["dedup_tag"] = df["prompt"].map(lambda p: prompt_counts[p] > threshold)
Import
# Primarily used as a standalone script:
# python -m fastchat.serve.monitor.deduplication --input data.jsonl --output tagged.jsonl
# Core logic can also be adapted inline:
import pandas as pd
def tag_high_frequency_prompts(df, percentile_cutoff=0.99):
prompt_counts = df["prompt"].value_counts()
threshold = prompt_counts.quantile(percentile_cutoff)
df["dedup_tag"] = df["prompt"].map(lambda p: prompt_counts[p] > threshold)
return df
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_file | str | Yes | Path to a JSONL file containing conversation records with a "prompt" field |
| percentile_cutoff | float | No | Percentile threshold above which prompts are tagged as high-frequency (default: 0.99) |
Outputs
| Name | Type | Description |
|---|---|---|
| tagged_data | JSONL file | The input data augmented with a dedup_tag boolean column indicating high-frequency prompts |
| frequency_stats | stdout | Summary statistics printed to stdout showing the distribution of prompt frequencies and the computed threshold |
Usage Examples
import pandas as pd
# Load conversation data
df = pd.read_json("conversations.jsonl", lines=True)
# Compute prompt frequencies
prompt_counts = df["prompt"].value_counts()
print(f"Unique prompts: {len(prompt_counts)}")
print(f"Most common prompt appears {prompt_counts.iloc[0]} times")
# Apply percentile-based dedup tagging
percentile_cutoff = 0.99
threshold = prompt_counts.quantile(percentile_cutoff)
df["dedup_tag"] = df["prompt"].map(lambda p: prompt_counts[p] > threshold)
tagged_count = df["dedup_tag"].sum()
print(f"Tagged {tagged_count} conversations as high-frequency ({tagged_count/len(df)*100:.1f}%)")
# Filter or sample for downstream use
clean_df = df[~df["dedup_tag"]]
print(f"Retained {len(clean_df)} conversations after dedup filtering")
Related Pages
- Principle:Lm_sys_FastChat_Prompt_Deduplication
- Implements: Principle:Lm_sys_FastChat_Prompt_Deduplication
- Lm_sys_FastChat_Clean_Chat_Data - Upstream data cleaning before deduplication
- Lm_sys_FastChat_Filter_Bad_Conv_Arena33k - Uses dedup tags to classify TOO_FREQUENT conversations
- Lm_sys_FastChat_Filter_Bad_Conv_Chat1M - Uses dedup tags in the Chat 1M release pipeline
- Lm_sys_FastChat_Basic_Stats - Basic statistics that benefit from deduplicated data