Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat Optional Clean

From Leeroopedia


Field Value
Page Type Implementation
Title Optional Clean
Repository lm-sys/FastChat
Knowledge Sources Source Code Analysis, API Documentation
Domains Data Preprocessing, NLP Pipeline, Language Detection
Last Updated 2026-02-07 14:00 GMT

Overview

Optional Clean is the implementation module for language-based filtering and repetitive pattern detection in the FastChat ShareGPT Data Pipeline. It uses the polyglot library with pycld2 for language detection and provides configurable options to keep only specific languages, skip specific languages, or filter out conversations containing repetitive digit patterns.

Description

This module provides optional cleaning filters that operate on already HTML-cleaned ShareGPT conversation data. The primary function, skip, evaluates each conversation against the configured language and repetition criteria and returns a boolean indicating whether the conversation should be excluded from the output.

The module operates as a CLI script that iterates over all conversations in the input JSON, applies the skip function to each, and writes only the non-skipped conversations to the output file. Output file names are auto-generated based on the filter configuration if not explicitly specified.

Usage

CLI Invocation

# Skip Korean conversations
python3 -m fastchat.data.optional_clean --in input.json --out output.json --skip-lang ko

# Keep only English conversations
python3 -m fastchat.data.optional_clean --in input.json --out output.json --keep-lang en

# Filter repetitive digit patterns
python3 -m fastchat.data.optional_clean --in input.json --out output.json --reduce-rep

# Combined: keep English and filter repetitive patterns
python3 -m fastchat.data.optional_clean --in input.json --out output.json --keep-lang en --reduce-rep

CLI Parameters

Parameter Type Required Default Description
--in-file str Yes -- Path to input JSON file (cleaned conversations)
--out-file str No Auto-generated Path to output JSON file; auto-generated from filters if omitted
--keep-lang str No "all" Only keep conversations in this language; choices: "all", "en"
--skip-lang str No None Skip conversations detected as this language code (e.g., "ko")
--reduce-rep flag No False Filter conversations with 8+ consecutive identical digits

Note: --keep-lang and --skip-lang are mutually exclusive. Either keep_lang must be "all" or skip_lang must be None.

Programmatic Import

from fastchat.data.optional_clean import skip

Code Reference

Source Location

Item Location
Module fastchat/data/optional_clean.py
skip function Lines 21-44
Main/CLI Lines 47-91
Repository github.com/lm-sys/FastChat

Function Signatures

def skip(conv, args) -> bool:
    """
    Determine whether a conversation should be skipped (excluded).

    Checks language detection (via polyglot.detect.Detector) against
    keep_lang and skip_lang settings. If reduce_rep is enabled, also
    checks for repetitive digit patterns (8+ consecutive identical digits).

    Args:
        conv: dict with "conversations" key containing list of
              {"from": str, "value": str} dicts.
        args: argparse.Namespace with keep_lang, skip_lang, reduce_rep fields.

    Returns:
        True if conversation should be skipped, False if it should be kept.
    """

Import

from fastchat.data.optional_clean import skip

Note: The skip function requires an args object with keep_lang, skip_lang, and reduce_rep attributes. When importing programmatically, you must construct a compatible namespace object.

I/O Contract

Inputs

Input Type Description
in_file JSON file Cleaned ShareGPT JSON from the HTML cleaning step: a list of dicts, each with "id" and "conversations" (list of {"from": str, "value": str}). Values should already be in Markdown (not HTML).

Outputs

Output Type Description
out_file JSON file Language-filtered JSON: same structure as input, but with conversations excluded based on language detection and repetition filters.

Dependencies

Package Purpose
polyglot Language detection framework
pycld2 Compact Language Detector 2 backend for polyglot
pyicu Unicode support required by polyglot
tqdm Progress bar

Install:

pip3 install polyglot pyicu pycld2

Usage Examples

Pipeline Usage (from prepare_all.py)

python3 -m fastchat.data.optional_clean \
    --in ~/datasets/sharegpt_20230521_4k_clean.json \
    --out ~/datasets/sharegpt_20230521_4k_clean_lang.json \
    --skip-lang ko

Programmatic Usage

import argparse
import json
from fastchat.data.optional_clean import skip

# Build a compatible args namespace
args = argparse.Namespace(keep_lang="all", skip_lang="ko", reduce_rep=False)

content = json.load(open("sharegpt_clean.json", "r"))
filtered = [conv for conv in content if not skip(conv, args)]

print(f"#in: {len(content)}, #out: {len(filtered)}")
json.dump(filtered, open("sharegpt_clean_lang.json", "w"), indent=2, ensure_ascii=False)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment