Implementation:Lm sys FastChat Filter Bad Conv Chat1M

Knowledge Sources	Lm_sys_FastChat
Domains	Data_Processing, Model_Evaluation
Last Updated	2026-02-07 06:00 GMT

Overview

Filters conversations for the LMSYS Chat 1M dataset release with additional support for traditional-to-simplified Chinese conversion using the OpenCC library.

Description

Filter Bad Conv Chat1M is a data quality filtering module designed specifically for the LMSYS Chat 1M dataset release pipeline. It shares the same fundamental architecture as the Arena 33K filter, using a TypeCode enum to classify conversations, but includes additional language processing capabilities. Most notably, it integrates the OpenCC library to perform traditional-to-simplified Chinese character conversion (using the "t2s" configuration), ensuring consistency in Chinese text within the released dataset.

The detect_type function applies a sequence of quality checks to each conversation, returning a TypeCode that indicates whether the conversation is suitable for release or the specific reason for exclusion. The checks cover formatting validation, anonymization detection, redaction detection, blocked word filtering, blocked model filtering, minimum length requirements, and frequency-based bot detection. The Chinese conversion step normalizes text before applying these checks, preventing false negatives that could arise from character variant differences.

This module processes the significantly larger LMSYS Chat 1M corpus compared to the Arena 33K dataset, making its multilingual handling capabilities essential. The 1M dataset contains conversations in many languages, and the Chinese normalization ensures consistent filtering behavior across traditional and simplified Chinese inputs.

Usage

Use this module when preparing the LMSYS Chat 1M dataset for public release. It should be applied to all conversations in the corpus, with only those classified as TypeCode.CORRECT retained for the final release. The OpenCC dependency must be installed for Chinese text normalization to function.

Code Reference

Source Location

Repository: Lm_sys_FastChat
File: fastchat/serve/monitor/dataset_release_scripts/lmsys_chat_1m/filter_bad_conv.py
Lines: 1-148

Signature

class TypeCode(Enum):
    CORRECT = 0
    ANONYMIZED = 1
    REDACTED = 2
    BAD_FORMAT = 3
    BLOCKED_WORD = 4
    BLOCKED_MODEL = 5
    TOO_SHORT = 6
    TOO_FREQUENT = 7

def detect_type(conv: dict) -> TypeCode:
    """Classify a conversation for release suitability, applying Chinese text normalization via OpenCC."""

Import

from fastchat.serve.monitor.dataset_release_scripts.lmsys_chat_1m.filter_bad_conv import detect_type

I/O Contract

Inputs

Name	Type	Required	Description
conv	dict	Yes	A conversation dictionary containing messages, model identifiers, language metadata, and conversation ID

Outputs

Name	Type	Description
type_code	TypeCode	An enum value indicating the classification of the conversation: CORRECT for release-worthy entries, or the specific exclusion reason

Dependencies

Package	Usage
opencc	Traditional-to-simplified Chinese conversion via opencc.OpenCC("t2s")

Chinese Text Normalization

The module initializes an OpenCC converter with the "t2s" (traditional to simplified) configuration. This converter is applied to Chinese text content before blocked word checks and other filtering rules, ensuring that:

Traditional Chinese variants of blocked words are correctly detected
Text length calculations are consistent across character sets
Frequency-based deduplication treats traditional and simplified versions of the same prompt as equivalent

Usage Examples

from fastchat.serve.monitor.dataset_release_scripts.lmsys_chat_1m.filter_bad_conv import (
    detect_type,
    TypeCode,
)

# Classify a conversation with Chinese content
conv = {
    "conversation_id": "chat1m_001",
    "model": "chatglm-6b",
    "language": "Chinese",
    "conversation": [
        {"role": "user", "content": "什么是机器学习？"},
        {"role": "assistant", "content": "机器学习是人工智能的一个分支..."},
    ],
}

result = detect_type(conv)
if result == TypeCode.CORRECT:
    print("Conversation passes all quality checks")
else:
    print(f"Filtered out: {result.name}")

# Filter entire dataset for release
import json

stats = {tc: 0 for tc in TypeCode}
clean = []
with open("lmsys_chat_1m_raw.jsonl") as f:
    for line in f:
        conv = json.loads(line)
        tc = detect_type(conv)
        stats[tc] += 1
        if tc == TypeCode.CORRECT:
            clean.append(conv)

for tc, count in stats.items():
    print(f"{tc.name}: {count}")
print(f"Total retained: {len(clean)}")

Related Pages

Principle:Lm_sys_FastChat_Conversation_Content_Filtering
Implements: Principle:Lm_sys_FastChat_Conversation_Content_Filtering
Lm_sys_FastChat_Filter_Bad_Conv_Arena33k - Similar filtering for the Arena 33K dataset (without Chinese conversion)
Lm_sys_FastChat_Clean_Chat_Data - Upstream data cleaning and deduplication
Lm_sys_FastChat_Deduplication - High-frequency prompt deduplication for TOO_FREQUENT detection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment