Implementation:Lm sys FastChat Deduplication

Knowledge Sources	Lm_sys_FastChat
Domains	Data_Processing, Model_Evaluation
Last Updated	2026-02-07 06:00 GMT

Overview

Identifies and tags high-frequency prompts in FastChat conversation data using percentile-based cutoffs to support deduplication sampling.

Description

Deduplication is a data quality module that detects high-frequency prompts which may indicate bot traffic, automated testing, or popular copy-pasted queries. Rather than removing duplicates outright, the module tags conversations with a dedup_tag column that downstream processes can use for stratified sampling or filtering. This approach preserves all original data while providing the metadata needed for intelligent deduplication.

The module works by computing the frequency of each unique prompt across the entire conversation corpus. It then applies a configurable percentile cutoff to determine which prompts qualify as "high-frequency." Prompts above the threshold receive a dedup tag, allowing downstream consumers to either exclude them entirely or sample them at a reduced rate. This percentile-based approach adapts automatically to the data distribution rather than relying on fixed frequency thresholds.

The deduplication script is designed to be run as a standalone process that reads conversation data, computes frequency statistics, applies tags, and writes the annotated output. It integrates with the broader dataset release pipeline and is used by both the Arena 33K and Chat 1M filter modules to identify TOO_FREQUENT entries.

Usage

Use this module as a preprocessing step before dataset release or statistical analysis when you suspect the data contains high-frequency duplicate prompts. It is especially important for arena data where popular benchmark questions or automated scripts can create skewed distributions.

Code Reference

Source Location

Repository: Lm_sys_FastChat
File: fastchat/serve/monitor/deduplication.py
Lines: 1-85

Signature

# Main script execution
# Reads conversation data, computes prompt frequency,
# applies percentile-based cutoff, and adds dedup_tag column.
#
# Key operations:
#   prompt_counts = df["prompt"].value_counts()
#   threshold = prompt_counts.quantile(percentile_cutoff)
#   df["dedup_tag"] = df["prompt"].map(lambda p: prompt_counts[p] > threshold)

Import

# Primarily used as a standalone script:
# python -m fastchat.serve.monitor.deduplication --input data.jsonl --output tagged.jsonl

# Core logic can also be adapted inline:
import pandas as pd

def tag_high_frequency_prompts(df, percentile_cutoff=0.99):
    prompt_counts = df["prompt"].value_counts()
    threshold = prompt_counts.quantile(percentile_cutoff)
    df["dedup_tag"] = df["prompt"].map(lambda p: prompt_counts[p] > threshold)
    return df

I/O Contract

Inputs

Name	Type	Required	Description
input_file	str	Yes	Path to a JSONL file containing conversation records with a "prompt" field
percentile_cutoff	float	No	Percentile threshold above which prompts are tagged as high-frequency (default: 0.99)

Outputs

Name	Type	Description
tagged_data	JSONL file	The input data augmented with a dedup_tag boolean column indicating high-frequency prompts
frequency_stats	stdout	Summary statistics printed to stdout showing the distribution of prompt frequencies and the computed threshold

Usage Examples

import pandas as pd

# Load conversation data
df = pd.read_json("conversations.jsonl", lines=True)

# Compute prompt frequencies
prompt_counts = df["prompt"].value_counts()
print(f"Unique prompts: {len(prompt_counts)}")
print(f"Most common prompt appears {prompt_counts.iloc[0]} times")

# Apply percentile-based dedup tagging
percentile_cutoff = 0.99
threshold = prompt_counts.quantile(percentile_cutoff)
df["dedup_tag"] = df["prompt"].map(lambda p: prompt_counts[p] > threshold)

tagged_count = df["dedup_tag"].sum()
print(f"Tagged {tagged_count} conversations as high-frequency ({tagged_count/len(df)*100:.1f}%)")

# Filter or sample for downstream use
clean_df = df[~df["dedup_tag"]]
print(f"Retained {len(clean_df)} conversations after dedup filtering")

Related Pages

Principle:Lm_sys_FastChat_Prompt_Deduplication
Implements: Principle:Lm_sys_FastChat_Prompt_Deduplication
Lm_sys_FastChat_Clean_Chat_Data - Upstream data cleaning before deduplication
Lm_sys_FastChat_Filter_Bad_Conv_Arena33k - Uses dedup tags to classify TOO_FREQUENT conversations
Lm_sys_FastChat_Filter_Bad_Conv_Chat1M - Uses dedup tags in the Chat 1M release pipeline
Lm_sys_FastChat_Basic_Stats - Basic statistics that benefit from deduplicated data

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment