Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Lm sys FastChat Clean Chat Data

From Leeroopedia


Knowledge Sources
Domains Data_Processing, Model_Evaluation
Last Updated 2026-02-07 06:00 GMT

Overview

Cleans and deduplicates FastChat conversation logs by filtering for specific action types and removing malformed or duplicate entries.

Description

Clean Chat Data is a data preprocessing module that sanitizes raw FastChat conversation logs for downstream analysis. It reads JSON log files, filters entries by action type (such as votes, conversations, or flagged content), and applies cleaning rules to remove duplicates, malformed records, and irrelevant data. The module supports parallel processing to handle large volumes of log data efficiently.

The cleaning pipeline operates in three stages. First, get_action_type_data reads individual log files and extracts entries matching a specified action type. Then, process_data applies row-level cleaning logic including deduplication checks and format validation. Finally, clean_chat_data orchestrates the full pipeline across multiple log files using configurable parallelism, producing a cleaned DataFrame ready for statistical analysis or dataset release.

This module is essential for preparing raw arena data before computing Elo ratings, generating leaderboard statistics, or releasing public datasets. Without this cleaning step, duplicate votes, malformed conversations, and irrelevant log entries would skew analytical results.

Usage

Use this module as a preprocessing step before any analytical pipeline that consumes FastChat logs. It should be run after log collection and before statistical analysis, Elo computation, or dataset release preparation. It is especially important when working with logs that span long time periods or multiple server instances where duplicates are more likely.

Code Reference

Source Location

Signature

def clean_chat_data(log_files: list, action_type: str, num_parallel: int = 16) -> pd.DataFrame:
    """Clean and deduplicate chat data from multiple log files with parallel processing."""

def process_data(row: dict, action_type: str) -> dict:
    """Apply cleaning rules to a single log entry row based on the action type."""

def get_action_type_data(filename: str, action_type: str) -> list:
    """Extract entries of a specific action type from a single log file."""

Import

from fastchat.serve.monitor.clean_chat_data import clean_chat_data

I/O Contract

Inputs

Name Type Required Description
log_files list[str] Yes List of file paths to raw JSON log files
action_type str Yes The type of log action to filter for (e.g., "vote", "conversation", "flag")
num_parallel int No Number of parallel workers for file processing (default: 16)
row dict Yes A single log entry dictionary (used by process_data)
filename str Yes Path to a single log file (used by get_action_type_data)

Outputs

Name Type Description
cleaned_df pd.DataFrame clean_chat_data returns a deduplicated, cleaned DataFrame of log entries
processed_row dict process_data returns a cleaned version of the input row, or None if the row should be discarded
entries list[dict] get_action_type_data returns a list of log entries matching the specified action type

Usage Examples

from fastchat.serve.monitor.clean_chat_data import clean_chat_data

# Clean vote data from multiple log files
log_files = ["logs/server1.json", "logs/server2.json", "logs/server3.json"]
cleaned_df = clean_chat_data(log_files, action_type="vote", num_parallel=8)

print(f"Cleaned records: {len(cleaned_df)}")
print(f"Unique conversations: {cleaned_df['conv_id'].nunique()}")

# Use cleaned data for downstream Elo analysis
vote_counts = cleaned_df["model"].value_counts()
print(vote_counts.head(10))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment