Principle:Lm sys FastChat Chat Data Cleaning

Field	Value
Page Type	Principle
Title	Chat Data Cleaning
Repository	lm-sys/FastChat
Workflow	Dataset_Release
Domains	Data_Processing, Privacy
Knowledge Sources	fastchat/serve/monitor/clean_chat_data.py
Last Updated	2026-02-07 14:00 GMT

Overview

This principle defines the requirements and methods for cleaning raw chat conversation logs collected from the Chatbot Arena and related serving infrastructure. Cleaning transforms noisy, heterogeneous log data into well-structured, privacy-compliant datasets suitable for statistical analysis, model evaluation, and public dataset release. The cleaning pipeline must be idempotent and deterministic, ensuring that repeated execution on the same input produces identical output.

Description

Content Filtering

Raw chat logs may contain conversations that are inappropriate for release or analysis, including conversations with toxic content, personally identifiable information embedded in prompts, or system-level debug messages. Content filtering applies a series of rule-based and API-based checks to flag or remove such conversations. Conversations that fail any filter are excluded from the cleaned output, with optional logging of rejection reasons for pipeline auditing.

PII Removal

Personally identifiable information -- including email addresses, phone numbers, physical addresses, and names -- must be detected and removed before any dataset release. PII detection combines regex-based pattern matching (for structured PII like emails and phone numbers) with heuristic approaches for less structured information. When PII is detected, the affected conversation is either redacted (PII tokens replaced with placeholder markers) or excluded entirely, depending on the severity and density of PII occurrences.

Model Name Normalization

During Arena operation, model names may appear in various formats due to versioning, aliasing, or configuration changes (e.g., gpt-4-0314 vs. gpt-4 vs. GPT-4). Model name normalization maps all variants to canonical identifiers, ensuring consistent aggregation in downstream analyses. The normalization mapping is maintained as a configuration table that is updated as new models are added to the Arena.

Conversation Structure Validation

Valid conversations must conform to expected structural constraints: each conversation must have at least one user turn and one assistant turn, roles must alternate correctly, and no turn may be empty. Conversations that violate these constraints are either repaired (e.g., by trimming incomplete trailing turns) or excluded. Structure validation ensures that downstream consumers can safely assume well-formed input.

Deduplication

Duplicate conversations arise from retry logic, network errors, or user behavior (submitting the same prompt multiple times). Deduplication identifies and removes exact duplicates based on conversation content hashes, retaining only the first occurrence. This prevents duplicate entries from inflating dataset size and biasing statistical analyses.

Output Format Standardization

The cleaned output is written in a standardized JSON or JSONL format with consistent field names, data types, and encoding. Standardization ensures compatibility with downstream analysis scripts, dataset hosting platforms, and community tools that consume Arena data.

Theoretical Basis

Data quality is a prerequisite to valid statistical analysis. The principle of garbage in, garbage out dictates that any analysis performed on unclean data will produce unreliable conclusions. Cleaning pipelines must be idempotent -- applying the pipeline twice produces the same result as applying it once -- to ensure reproducibility of downstream analyses and dataset releases. Determinism (identical inputs always produce identical outputs, regardless of execution environment) is a stronger requirement that eliminates hidden dependencies on random seeds, timestamps, or system state. Together, idempotency and determinism guarantee that a published dataset can be independently verified by re-running the cleaning pipeline on the raw logs. PII removal is grounded in privacy-by-design principles, ensuring compliance with data protection regulations and ethical research standards.

Related Pages

Implementation:Lm_sys_FastChat_Clean_Chat_Data
Implemented by: Implementation:Lm_sys_FastChat_Clean_Chat_Data

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment