Implementation:Lm sys FastChat Clean Chat Data
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Model_Evaluation |
| Last Updated | 2026-02-07 06:00 GMT |
Overview
Cleans and deduplicates FastChat conversation logs by filtering for specific action types and removing malformed or duplicate entries.
Description
Clean Chat Data is a data preprocessing module that sanitizes raw FastChat conversation logs for downstream analysis. It reads JSON log files, filters entries by action type (such as votes, conversations, or flagged content), and applies cleaning rules to remove duplicates, malformed records, and irrelevant data. The module supports parallel processing to handle large volumes of log data efficiently.
The cleaning pipeline operates in three stages. First, get_action_type_data reads individual log files and extracts entries matching a specified action type. Then, process_data applies row-level cleaning logic including deduplication checks and format validation. Finally, clean_chat_data orchestrates the full pipeline across multiple log files using configurable parallelism, producing a cleaned DataFrame ready for statistical analysis or dataset release.
This module is essential for preparing raw arena data before computing Elo ratings, generating leaderboard statistics, or releasing public datasets. Without this cleaning step, duplicate votes, malformed conversations, and irrelevant log entries would skew analytical results.
Usage
Use this module as a preprocessing step before any analytical pipeline that consumes FastChat logs. It should be run after log collection and before statistical analysis, Elo computation, or dataset release preparation. It is especially important when working with logs that span long time periods or multiple server instances where duplicates are more likely.
Code Reference
Source Location
- Repository: Lm_sys_FastChat
- File: fastchat/serve/monitor/clean_chat_data.py
- Lines: 1-234
Signature
def clean_chat_data(log_files: list, action_type: str, num_parallel: int = 16) -> pd.DataFrame:
"""Clean and deduplicate chat data from multiple log files with parallel processing."""
def process_data(row: dict, action_type: str) -> dict:
"""Apply cleaning rules to a single log entry row based on the action type."""
def get_action_type_data(filename: str, action_type: str) -> list:
"""Extract entries of a specific action type from a single log file."""
Import
from fastchat.serve.monitor.clean_chat_data import clean_chat_data
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| log_files | list[str] | Yes | List of file paths to raw JSON log files |
| action_type | str | Yes | The type of log action to filter for (e.g., "vote", "conversation", "flag") |
| num_parallel | int | No | Number of parallel workers for file processing (default: 16) |
| row | dict | Yes | A single log entry dictionary (used by process_data) |
| filename | str | Yes | Path to a single log file (used by get_action_type_data) |
Outputs
| Name | Type | Description |
|---|---|---|
| cleaned_df | pd.DataFrame | clean_chat_data returns a deduplicated, cleaned DataFrame of log entries |
| processed_row | dict | process_data returns a cleaned version of the input row, or None if the row should be discarded |
| entries | list[dict] | get_action_type_data returns a list of log entries matching the specified action type |
Usage Examples
from fastchat.serve.monitor.clean_chat_data import clean_chat_data
# Clean vote data from multiple log files
log_files = ["logs/server1.json", "logs/server2.json", "logs/server3.json"]
cleaned_df = clean_chat_data(log_files, action_type="vote", num_parallel=8)
print(f"Cleaned records: {len(cleaned_df)}")
print(f"Unique conversations: {cleaned_df['conv_id'].nunique()}")
# Use cleaned data for downstream Elo analysis
vote_counts = cleaned_df["model"].value_counts()
print(vote_counts.head(10))
Related Pages
- Principle:Lm_sys_FastChat_Chat_Data_Cleaning
- Implements: Principle:Lm_sys_FastChat_Chat_Data_Cleaning
- Lm_sys_FastChat_Basic_Stats - Basic usage statistics from cleaned log data
- Lm_sys_FastChat_Filter_Bad_Conv_Arena33k - Conversation filtering for arena_33k dataset release
- Lm_sys_FastChat_Filter_Bad_Conv_Chat1M - Conversation filtering for lmsys_chat_1m dataset release
- Lm_sys_FastChat_Deduplication - High-frequency prompt deduplication