Implementation:Lm sys FastChat Optional Clean

Field	Value
Page Type	Implementation
Title	Optional Clean
Repository	lm-sys/FastChat
Knowledge Sources	Source Code Analysis, API Documentation
Domains	Data Preprocessing, NLP Pipeline, Language Detection
Last Updated	2026-02-07 14:00 GMT

Overview

Optional Clean is the implementation module for language-based filtering and repetitive pattern detection in the FastChat ShareGPT Data Pipeline. It uses the polyglot library with pycld2 for language detection and provides configurable options to keep only specific languages, skip specific languages, or filter out conversations containing repetitive digit patterns.

Description

This module provides optional cleaning filters that operate on already HTML-cleaned ShareGPT conversation data. The primary function, skip, evaluates each conversation against the configured language and repetition criteria and returns a boolean indicating whether the conversation should be excluded from the output.

The module operates as a CLI script that iterates over all conversations in the input JSON, applies the skip function to each, and writes only the non-skipped conversations to the output file. Output file names are auto-generated based on the filter configuration if not explicitly specified.

Usage

CLI Invocation

# Skip Korean conversations
python3 -m fastchat.data.optional_clean --in input.json --out output.json --skip-lang ko

# Keep only English conversations
python3 -m fastchat.data.optional_clean --in input.json --out output.json --keep-lang en

# Filter repetitive digit patterns
python3 -m fastchat.data.optional_clean --in input.json --out output.json --reduce-rep

# Combined: keep English and filter repetitive patterns
python3 -m fastchat.data.optional_clean --in input.json --out output.json --keep-lang en --reduce-rep

CLI Parameters

Parameter	Type	Required	Default	Description
`--in-file`	str	Yes	--	Path to input JSON file (cleaned conversations)
`--out-file`	str	No	Auto-generated	Path to output JSON file; auto-generated from filters if omitted
`--keep-lang`	str	No	`"all"`	Only keep conversations in this language; choices: `"all"`, `"en"`
`--skip-lang`	str	No	None	Skip conversations detected as this language code (e.g., `"ko"`)
`--reduce-rep`	flag	No	False	Filter conversations with 8+ consecutive identical digits

Note: --keep-lang and --skip-lang are mutually exclusive. Either keep_lang must be "all" or skip_lang must be None.

Programmatic Import

from fastchat.data.optional_clean import skip

Code Reference

Source Location

Item	Location
Module	`fastchat/data/optional_clean.py`
skip function	Lines 21-44
Main/CLI	Lines 47-91
Repository	github.com/lm-sys/FastChat

Function Signatures

def skip(conv, args) -> bool:
    """
    Determine whether a conversation should be skipped (excluded).

    Checks language detection (via polyglot.detect.Detector) against
    keep_lang and skip_lang settings. If reduce_rep is enabled, also
    checks for repetitive digit patterns (8+ consecutive identical digits).

    Args:
        conv: dict with "conversations" key containing list of
              {"from": str, "value": str} dicts.
        args: argparse.Namespace with keep_lang, skip_lang, reduce_rep fields.

    Returns:
        True if conversation should be skipped, False if it should be kept.
    """

Import

from fastchat.data.optional_clean import skip

Note: The skip function requires an args object with keep_lang, skip_lang, and reduce_rep attributes. When importing programmatically, you must construct a compatible namespace object.

I/O Contract

Inputs

Input	Type	Description
in_file	JSON file	Cleaned ShareGPT JSON from the HTML cleaning step: a list of dicts, each with `"id"` and `"conversations"` (list of `{"from": str, "value": str}`). Values should already be in Markdown (not HTML).

Outputs

Output	Type	Description
out_file	JSON file	Language-filtered JSON: same structure as input, but with conversations excluded based on language detection and repetition filters.

Dependencies

Package	Purpose
polyglot	Language detection framework
pycld2	Compact Language Detector 2 backend for polyglot
pyicu	Unicode support required by polyglot
tqdm	Progress bar

Install:

pip3 install polyglot pyicu pycld2

Usage Examples

Pipeline Usage (from prepare_all.py)

python3 -m fastchat.data.optional_clean \
    --in ~/datasets/sharegpt_20230521_4k_clean.json \
    --out ~/datasets/sharegpt_20230521_4k_clean_lang.json \
    --skip-lang ko

Programmatic Usage

import argparse
import json
from fastchat.data.optional_clean import skip

# Build a compatible args namespace
args = argparse.Namespace(keep_lang="all", skip_lang="ko", reduce_rep=False)

content = json.load(open("sharegpt_clean.json", "r"))
filtered = [conv for conv in content if not skip(conv, args)]

print(f"#in: {len(content)}, #out: {len(filtered)}")
json.dump(filtered, open("sharegpt_clean_lang.json", "w"), indent=2, ensure_ascii=False)

Related Pages

Principle:Lm_sys_FastChat_Language_Based_Filtering
Principle:Lm_sys_FastChat_Language_Based_Filtering -- The principle that this implementation realizes
Implementation:Lm_sys_FastChat_Clean_ShareGPT -- Previous pipeline step: HTML cleaning
Implementation:Lm_sys_FastChat_Split_Long_Conversation -- Next pipeline step: conversation splitting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment