Implementation:Lm sys FastChat Clean Chat Data

Knowledge Sources	Lm_sys_FastChat
Domains	Data_Processing, Model_Evaluation
Last Updated	2026-02-07 06:00 GMT

Overview

Cleans and deduplicates FastChat conversation logs by filtering for specific action types and removing malformed or duplicate entries.

Description

Clean Chat Data is a data preprocessing module that sanitizes raw FastChat conversation logs for downstream analysis. It reads JSON log files, filters entries by action type (such as votes, conversations, or flagged content), and applies cleaning rules to remove duplicates, malformed records, and irrelevant data. The module supports parallel processing to handle large volumes of log data efficiently.

The cleaning pipeline operates in three stages. First, get_action_type_data reads individual log files and extracts entries matching a specified action type. Then, process_data applies row-level cleaning logic including deduplication checks and format validation. Finally, clean_chat_data orchestrates the full pipeline across multiple log files using configurable parallelism, producing a cleaned DataFrame ready for statistical analysis or dataset release.

This module is essential for preparing raw arena data before computing Elo ratings, generating leaderboard statistics, or releasing public datasets. Without this cleaning step, duplicate votes, malformed conversations, and irrelevant log entries would skew analytical results.

Usage

Use this module as a preprocessing step before any analytical pipeline that consumes FastChat logs. It should be run after log collection and before statistical analysis, Elo computation, or dataset release preparation. It is especially important when working with logs that span long time periods or multiple server instances where duplicates are more likely.

Code Reference

Source Location

Repository: Lm_sys_FastChat
File: fastchat/serve/monitor/clean_chat_data.py
Lines: 1-234

Signature

def clean_chat_data(log_files: list, action_type: str, num_parallel: int = 16) -> pd.DataFrame:
    """Clean and deduplicate chat data from multiple log files with parallel processing."""

def process_data(row: dict, action_type: str) -> dict:
    """Apply cleaning rules to a single log entry row based on the action type."""

def get_action_type_data(filename: str, action_type: str) -> list:
    """Extract entries of a specific action type from a single log file."""

Import

from fastchat.serve.monitor.clean_chat_data import clean_chat_data

I/O Contract

Inputs

Name	Type	Required	Description
log_files	list[str]	Yes	List of file paths to raw JSON log files
action_type	str	Yes	The type of log action to filter for (e.g., "vote", "conversation", "flag")
num_parallel	int	No	Number of parallel workers for file processing (default: 16)
row	dict	Yes	A single log entry dictionary (used by process_data)
filename	str	Yes	Path to a single log file (used by get_action_type_data)

Outputs

Name	Type	Description
cleaned_df	pd.DataFrame	clean_chat_data returns a deduplicated, cleaned DataFrame of log entries
processed_row	dict	process_data returns a cleaned version of the input row, or None if the row should be discarded
entries	list[dict]	get_action_type_data returns a list of log entries matching the specified action type

Usage Examples

from fastchat.serve.monitor.clean_chat_data import clean_chat_data

# Clean vote data from multiple log files
log_files = ["logs/server1.json", "logs/server2.json", "logs/server3.json"]
cleaned_df = clean_chat_data(log_files, action_type="vote", num_parallel=8)

print(f"Cleaned records: {len(cleaned_df)}")
print(f"Unique conversations: {cleaned_df['conv_id'].nunique()}")

# Use cleaned data for downstream Elo analysis
vote_counts = cleaned_df["model"].value_counts()
print(vote_counts.head(10))

Related Pages

Principle:Lm_sys_FastChat_Chat_Data_Cleaning
Implements: Principle:Lm_sys_FastChat_Chat_Data_Cleaning
Lm_sys_FastChat_Basic_Stats - Basic usage statistics from cleaned log data
Lm_sys_FastChat_Filter_Bad_Conv_Arena33k - Conversation filtering for arena_33k dataset release
Lm_sys_FastChat_Filter_Bad_Conv_Chat1M - Conversation filtering for lmsys_chat_1m dataset release
Lm_sys_FastChat_Deduplication - High-frequency prompt deduplication

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment