Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Lm sys FastChat Filter Bad Conv Chat1M

From Leeroopedia


Knowledge Sources
Domains Data_Processing, Model_Evaluation
Last Updated 2026-02-07 06:00 GMT

Overview

Filters conversations for the LMSYS Chat 1M dataset release with additional support for traditional-to-simplified Chinese conversion using the OpenCC library.

Description

Filter Bad Conv Chat1M is a data quality filtering module designed specifically for the LMSYS Chat 1M dataset release pipeline. It shares the same fundamental architecture as the Arena 33K filter, using a TypeCode enum to classify conversations, but includes additional language processing capabilities. Most notably, it integrates the OpenCC library to perform traditional-to-simplified Chinese character conversion (using the "t2s" configuration), ensuring consistency in Chinese text within the released dataset.

The detect_type function applies a sequence of quality checks to each conversation, returning a TypeCode that indicates whether the conversation is suitable for release or the specific reason for exclusion. The checks cover formatting validation, anonymization detection, redaction detection, blocked word filtering, blocked model filtering, minimum length requirements, and frequency-based bot detection. The Chinese conversion step normalizes text before applying these checks, preventing false negatives that could arise from character variant differences.

This module processes the significantly larger LMSYS Chat 1M corpus compared to the Arena 33K dataset, making its multilingual handling capabilities essential. The 1M dataset contains conversations in many languages, and the Chinese normalization ensures consistent filtering behavior across traditional and simplified Chinese inputs.

Usage

Use this module when preparing the LMSYS Chat 1M dataset for public release. It should be applied to all conversations in the corpus, with only those classified as TypeCode.CORRECT retained for the final release. The OpenCC dependency must be installed for Chinese text normalization to function.

Code Reference

Source Location

Signature

class TypeCode(Enum):
    CORRECT = 0
    ANONYMIZED = 1
    REDACTED = 2
    BAD_FORMAT = 3
    BLOCKED_WORD = 4
    BLOCKED_MODEL = 5
    TOO_SHORT = 6
    TOO_FREQUENT = 7

def detect_type(conv: dict) -> TypeCode:
    """Classify a conversation for release suitability, applying Chinese text normalization via OpenCC."""

Import

from fastchat.serve.monitor.dataset_release_scripts.lmsys_chat_1m.filter_bad_conv import detect_type

I/O Contract

Inputs

Name Type Required Description
conv dict Yes A conversation dictionary containing messages, model identifiers, language metadata, and conversation ID

Outputs

Name Type Description
type_code TypeCode An enum value indicating the classification of the conversation: CORRECT for release-worthy entries, or the specific exclusion reason

Dependencies

Package Usage
opencc Traditional-to-simplified Chinese conversion via opencc.OpenCC("t2s")

Chinese Text Normalization

The module initializes an OpenCC converter with the "t2s" (traditional to simplified) configuration. This converter is applied to Chinese text content before blocked word checks and other filtering rules, ensuring that:

  • Traditional Chinese variants of blocked words are correctly detected
  • Text length calculations are consistent across character sets
  • Frequency-based deduplication treats traditional and simplified versions of the same prompt as equivalent

Usage Examples

from fastchat.serve.monitor.dataset_release_scripts.lmsys_chat_1m.filter_bad_conv import (
    detect_type,
    TypeCode,
)

# Classify a conversation with Chinese content
conv = {
    "conversation_id": "chat1m_001",
    "model": "chatglm-6b",
    "language": "Chinese",
    "conversation": [
        {"role": "user", "content": "什么是机器学习?"},
        {"role": "assistant", "content": "机器学习是人工智能的一个分支..."},
    ],
}

result = detect_type(conv)
if result == TypeCode.CORRECT:
    print("Conversation passes all quality checks")
else:
    print(f"Filtered out: {result.name}")

# Filter entire dataset for release
import json

stats = {tc: 0 for tc in TypeCode}
clean = []
with open("lmsys_chat_1m_raw.jsonl") as f:
    for line in f:
        conv = json.loads(line)
        tc = detect_type(conv)
        stats[tc] += 1
        if tc == TypeCode.CORRECT:
            clean.append(conv)

for tc, count in stats.items():
    print(f"{tc.name}: {count}")
print(f"Total retained: {len(clean)}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment