Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Lm sys FastChat Deduplication

From Leeroopedia


Knowledge Sources
Domains Data_Processing, Model_Evaluation
Last Updated 2026-02-07 06:00 GMT

Overview

Identifies and tags high-frequency prompts in FastChat conversation data using percentile-based cutoffs to support deduplication sampling.

Description

Deduplication is a data quality module that detects high-frequency prompts which may indicate bot traffic, automated testing, or popular copy-pasted queries. Rather than removing duplicates outright, the module tags conversations with a dedup_tag column that downstream processes can use for stratified sampling or filtering. This approach preserves all original data while providing the metadata needed for intelligent deduplication.

The module works by computing the frequency of each unique prompt across the entire conversation corpus. It then applies a configurable percentile cutoff to determine which prompts qualify as "high-frequency." Prompts above the threshold receive a dedup tag, allowing downstream consumers to either exclude them entirely or sample them at a reduced rate. This percentile-based approach adapts automatically to the data distribution rather than relying on fixed frequency thresholds.

The deduplication script is designed to be run as a standalone process that reads conversation data, computes frequency statistics, applies tags, and writes the annotated output. It integrates with the broader dataset release pipeline and is used by both the Arena 33K and Chat 1M filter modules to identify TOO_FREQUENT entries.

Usage

Use this module as a preprocessing step before dataset release or statistical analysis when you suspect the data contains high-frequency duplicate prompts. It is especially important for arena data where popular benchmark questions or automated scripts can create skewed distributions.

Code Reference

Source Location

Signature

# Main script execution
# Reads conversation data, computes prompt frequency,
# applies percentile-based cutoff, and adds dedup_tag column.
#
# Key operations:
#   prompt_counts = df["prompt"].value_counts()
#   threshold = prompt_counts.quantile(percentile_cutoff)
#   df["dedup_tag"] = df["prompt"].map(lambda p: prompt_counts[p] > threshold)

Import

# Primarily used as a standalone script:
# python -m fastchat.serve.monitor.deduplication --input data.jsonl --output tagged.jsonl

# Core logic can also be adapted inline:
import pandas as pd

def tag_high_frequency_prompts(df, percentile_cutoff=0.99):
    prompt_counts = df["prompt"].value_counts()
    threshold = prompt_counts.quantile(percentile_cutoff)
    df["dedup_tag"] = df["prompt"].map(lambda p: prompt_counts[p] > threshold)
    return df

I/O Contract

Inputs

Name Type Required Description
input_file str Yes Path to a JSONL file containing conversation records with a "prompt" field
percentile_cutoff float No Percentile threshold above which prompts are tagged as high-frequency (default: 0.99)

Outputs

Name Type Description
tagged_data JSONL file The input data augmented with a dedup_tag boolean column indicating high-frequency prompts
frequency_stats stdout Summary statistics printed to stdout showing the distribution of prompt frequencies and the computed threshold

Usage Examples

import pandas as pd

# Load conversation data
df = pd.read_json("conversations.jsonl", lines=True)

# Compute prompt frequencies
prompt_counts = df["prompt"].value_counts()
print(f"Unique prompts: {len(prompt_counts)}")
print(f"Most common prompt appears {prompt_counts.iloc[0]} times")

# Apply percentile-based dedup tagging
percentile_cutoff = 0.99
threshold = prompt_counts.quantile(percentile_cutoff)
df["dedup_tag"] = df["prompt"].map(lambda p: prompt_counts[p] > threshold)

tagged_count = df["dedup_tag"].sum()
print(f"Tagged {tagged_count} conversations as high-frequency ({tagged_count/len(df)*100:.1f}%)")

# Filter or sample for downstream use
clean_df = df[~df["dedup_tag"]]
print(f"Retained {len(clean_df)} conversations after dedup filtering")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment