Implementation:Lm sys FastChat Filter Bad Conv Chat1M
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Model_Evaluation |
| Last Updated | 2026-02-07 06:00 GMT |
Overview
Filters conversations for the LMSYS Chat 1M dataset release with additional support for traditional-to-simplified Chinese conversion using the OpenCC library.
Description
Filter Bad Conv Chat1M is a data quality filtering module designed specifically for the LMSYS Chat 1M dataset release pipeline. It shares the same fundamental architecture as the Arena 33K filter, using a TypeCode enum to classify conversations, but includes additional language processing capabilities. Most notably, it integrates the OpenCC library to perform traditional-to-simplified Chinese character conversion (using the "t2s" configuration), ensuring consistency in Chinese text within the released dataset.
The detect_type function applies a sequence of quality checks to each conversation, returning a TypeCode that indicates whether the conversation is suitable for release or the specific reason for exclusion. The checks cover formatting validation, anonymization detection, redaction detection, blocked word filtering, blocked model filtering, minimum length requirements, and frequency-based bot detection. The Chinese conversion step normalizes text before applying these checks, preventing false negatives that could arise from character variant differences.
This module processes the significantly larger LMSYS Chat 1M corpus compared to the Arena 33K dataset, making its multilingual handling capabilities essential. The 1M dataset contains conversations in many languages, and the Chinese normalization ensures consistent filtering behavior across traditional and simplified Chinese inputs.
Usage
Use this module when preparing the LMSYS Chat 1M dataset for public release. It should be applied to all conversations in the corpus, with only those classified as TypeCode.CORRECT retained for the final release. The OpenCC dependency must be installed for Chinese text normalization to function.
Code Reference
Source Location
- Repository: Lm_sys_FastChat
- File: fastchat/serve/monitor/dataset_release_scripts/lmsys_chat_1m/filter_bad_conv.py
- Lines: 1-148
Signature
class TypeCode(Enum):
CORRECT = 0
ANONYMIZED = 1
REDACTED = 2
BAD_FORMAT = 3
BLOCKED_WORD = 4
BLOCKED_MODEL = 5
TOO_SHORT = 6
TOO_FREQUENT = 7
def detect_type(conv: dict) -> TypeCode:
"""Classify a conversation for release suitability, applying Chinese text normalization via OpenCC."""
Import
from fastchat.serve.monitor.dataset_release_scripts.lmsys_chat_1m.filter_bad_conv import detect_type
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| conv | dict | Yes | A conversation dictionary containing messages, model identifiers, language metadata, and conversation ID |
Outputs
| Name | Type | Description |
|---|---|---|
| type_code | TypeCode | An enum value indicating the classification of the conversation: CORRECT for release-worthy entries, or the specific exclusion reason |
Dependencies
| Package | Usage |
|---|---|
| opencc | Traditional-to-simplified Chinese conversion via opencc.OpenCC("t2s") |
Chinese Text Normalization
The module initializes an OpenCC converter with the "t2s" (traditional to simplified) configuration. This converter is applied to Chinese text content before blocked word checks and other filtering rules, ensuring that:
- Traditional Chinese variants of blocked words are correctly detected
- Text length calculations are consistent across character sets
- Frequency-based deduplication treats traditional and simplified versions of the same prompt as equivalent
Usage Examples
from fastchat.serve.monitor.dataset_release_scripts.lmsys_chat_1m.filter_bad_conv import (
detect_type,
TypeCode,
)
# Classify a conversation with Chinese content
conv = {
"conversation_id": "chat1m_001",
"model": "chatglm-6b",
"language": "Chinese",
"conversation": [
{"role": "user", "content": "什么是机器学习?"},
{"role": "assistant", "content": "机器学习是人工智能的一个分支..."},
],
}
result = detect_type(conv)
if result == TypeCode.CORRECT:
print("Conversation passes all quality checks")
else:
print(f"Filtered out: {result.name}")
# Filter entire dataset for release
import json
stats = {tc: 0 for tc in TypeCode}
clean = []
with open("lmsys_chat_1m_raw.jsonl") as f:
for line in f:
conv = json.loads(line)
tc = detect_type(conv)
stats[tc] += 1
if tc == TypeCode.CORRECT:
clean.append(conv)
for tc, count in stats.items():
print(f"{tc.name}: {count}")
print(f"Total retained: {len(clean)}")
Related Pages
- Principle:Lm_sys_FastChat_Conversation_Content_Filtering
- Implements: Principle:Lm_sys_FastChat_Conversation_Content_Filtering
- Lm_sys_FastChat_Filter_Bad_Conv_Arena33k - Similar filtering for the Arena 33K dataset (without Chinese conversion)
- Lm_sys_FastChat_Clean_Chat_Data - Upstream data cleaning and deduplication
- Lm_sys_FastChat_Deduplication - High-frequency prompt deduplication for TOO_FREQUENT detection