Implementation:Lm sys FastChat Optional Clean
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | Optional Clean |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source Code Analysis, API Documentation |
| Domains | Data Preprocessing, NLP Pipeline, Language Detection |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Optional Clean is the implementation module for language-based filtering and repetitive pattern detection in the FastChat ShareGPT Data Pipeline. It uses the polyglot library with pycld2 for language detection and provides configurable options to keep only specific languages, skip specific languages, or filter out conversations containing repetitive digit patterns.
Description
This module provides optional cleaning filters that operate on already HTML-cleaned ShareGPT conversation data. The primary function, skip, evaluates each conversation against the configured language and repetition criteria and returns a boolean indicating whether the conversation should be excluded from the output.
The module operates as a CLI script that iterates over all conversations in the input JSON, applies the skip function to each, and writes only the non-skipped conversations to the output file. Output file names are auto-generated based on the filter configuration if not explicitly specified.
Usage
CLI Invocation
# Skip Korean conversations
python3 -m fastchat.data.optional_clean --in input.json --out output.json --skip-lang ko
# Keep only English conversations
python3 -m fastchat.data.optional_clean --in input.json --out output.json --keep-lang en
# Filter repetitive digit patterns
python3 -m fastchat.data.optional_clean --in input.json --out output.json --reduce-rep
# Combined: keep English and filter repetitive patterns
python3 -m fastchat.data.optional_clean --in input.json --out output.json --keep-lang en --reduce-rep
CLI Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--in-file |
str | Yes | -- | Path to input JSON file (cleaned conversations) |
--out-file |
str | No | Auto-generated | Path to output JSON file; auto-generated from filters if omitted |
--keep-lang |
str | No | "all" |
Only keep conversations in this language; choices: "all", "en"
|
--skip-lang |
str | No | None | Skip conversations detected as this language code (e.g., "ko")
|
--reduce-rep |
flag | No | False | Filter conversations with 8+ consecutive identical digits |
Note: --keep-lang and --skip-lang are mutually exclusive. Either keep_lang must be "all" or skip_lang must be None.
Programmatic Import
from fastchat.data.optional_clean import skip
Code Reference
Source Location
| Item | Location |
|---|---|
| Module | fastchat/data/optional_clean.py
|
| skip function | Lines 21-44 |
| Main/CLI | Lines 47-91 |
| Repository | github.com/lm-sys/FastChat |
Function Signatures
def skip(conv, args) -> bool:
"""
Determine whether a conversation should be skipped (excluded).
Checks language detection (via polyglot.detect.Detector) against
keep_lang and skip_lang settings. If reduce_rep is enabled, also
checks for repetitive digit patterns (8+ consecutive identical digits).
Args:
conv: dict with "conversations" key containing list of
{"from": str, "value": str} dicts.
args: argparse.Namespace with keep_lang, skip_lang, reduce_rep fields.
Returns:
True if conversation should be skipped, False if it should be kept.
"""
Import
from fastchat.data.optional_clean import skip
Note: The skip function requires an args object with keep_lang, skip_lang, and reduce_rep attributes. When importing programmatically, you must construct a compatible namespace object.
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
| in_file | JSON file | Cleaned ShareGPT JSON from the HTML cleaning step: a list of dicts, each with "id" and "conversations" (list of {"from": str, "value": str}). Values should already be in Markdown (not HTML).
|
Outputs
| Output | Type | Description |
|---|---|---|
| out_file | JSON file | Language-filtered JSON: same structure as input, but with conversations excluded based on language detection and repetition filters. |
Dependencies
| Package | Purpose |
|---|---|
| polyglot | Language detection framework |
| pycld2 | Compact Language Detector 2 backend for polyglot |
| pyicu | Unicode support required by polyglot |
| tqdm | Progress bar |
Install:
pip3 install polyglot pyicu pycld2
Usage Examples
Pipeline Usage (from prepare_all.py)
python3 -m fastchat.data.optional_clean \
--in ~/datasets/sharegpt_20230521_4k_clean.json \
--out ~/datasets/sharegpt_20230521_4k_clean_lang.json \
--skip-lang ko
Programmatic Usage
import argparse
import json
from fastchat.data.optional_clean import skip
# Build a compatible args namespace
args = argparse.Namespace(keep_lang="all", skip_lang="ko", reduce_rep=False)
content = json.load(open("sharegpt_clean.json", "r"))
filtered = [conv for conv in content if not skip(conv, args)]
print(f"#in: {len(content)}, #out: {len(filtered)}")
json.dump(filtered, open("sharegpt_clean_lang.json", "w"), indent=2, ensure_ascii=False)
Related Pages
- Principle:Lm_sys_FastChat_Language_Based_Filtering
- Principle:Lm_sys_FastChat_Language_Based_Filtering -- The principle that this implementation realizes
- Implementation:Lm_sys_FastChat_Clean_ShareGPT -- Previous pipeline step: HTML cleaning
- Implementation:Lm_sys_FastChat_Split_Long_Conversation -- Next pipeline step: conversation splitting