Principle:Lm sys FastChat Chat Data Cleaning
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Chat Data Cleaning |
| Repository | lm-sys/FastChat |
| Workflow | Dataset_Release |
| Domains | Data_Processing, Privacy |
| Knowledge Sources | fastchat/serve/monitor/clean_chat_data.py |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This principle defines the requirements and methods for cleaning raw chat conversation logs collected from the Chatbot Arena and related serving infrastructure. Cleaning transforms noisy, heterogeneous log data into well-structured, privacy-compliant datasets suitable for statistical analysis, model evaluation, and public dataset release. The cleaning pipeline must be idempotent and deterministic, ensuring that repeated execution on the same input produces identical output.
Description
Content Filtering
Raw chat logs may contain conversations that are inappropriate for release or analysis, including conversations with toxic content, personally identifiable information embedded in prompts, or system-level debug messages. Content filtering applies a series of rule-based and API-based checks to flag or remove such conversations. Conversations that fail any filter are excluded from the cleaned output, with optional logging of rejection reasons for pipeline auditing.
PII Removal
Personally identifiable information -- including email addresses, phone numbers, physical addresses, and names -- must be detected and removed before any dataset release. PII detection combines regex-based pattern matching (for structured PII like emails and phone numbers) with heuristic approaches for less structured information. When PII is detected, the affected conversation is either redacted (PII tokens replaced with placeholder markers) or excluded entirely, depending on the severity and density of PII occurrences.
Model Name Normalization
During Arena operation, model names may appear in various formats due to versioning, aliasing, or configuration changes (e.g., gpt-4-0314 vs. gpt-4 vs. GPT-4). Model name normalization maps all variants to canonical identifiers, ensuring consistent aggregation in downstream analyses. The normalization mapping is maintained as a configuration table that is updated as new models are added to the Arena.
Conversation Structure Validation
Valid conversations must conform to expected structural constraints: each conversation must have at least one user turn and one assistant turn, roles must alternate correctly, and no turn may be empty. Conversations that violate these constraints are either repaired (e.g., by trimming incomplete trailing turns) or excluded. Structure validation ensures that downstream consumers can safely assume well-formed input.
Deduplication
Duplicate conversations arise from retry logic, network errors, or user behavior (submitting the same prompt multiple times). Deduplication identifies and removes exact duplicates based on conversation content hashes, retaining only the first occurrence. This prevents duplicate entries from inflating dataset size and biasing statistical analyses.
Output Format Standardization
The cleaned output is written in a standardized JSON or JSONL format with consistent field names, data types, and encoding. Standardization ensures compatibility with downstream analysis scripts, dataset hosting platforms, and community tools that consume Arena data.
Theoretical Basis
Data quality is a prerequisite to valid statistical analysis. The principle of garbage in, garbage out dictates that any analysis performed on unclean data will produce unreliable conclusions. Cleaning pipelines must be idempotent -- applying the pipeline twice produces the same result as applying it once -- to ensure reproducibility of downstream analyses and dataset releases. Determinism (identical inputs always produce identical outputs, regardless of execution environment) is a stronger requirement that eliminates hidden dependencies on random seeds, timestamps, or system state. Together, idempotency and determinism guarantee that a published dataset can be independently verified by re-running the cleaning pipeline on the raw logs. PII removal is grounded in privacy-by-design principles, ensuring compliance with data protection regulations and ethical research standards.