Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lm sys FastChat Language Based Filtering

From Leeroopedia


Field Value
Page Type Principle
Title Language Based Filtering
Repository lm-sys/FastChat
Knowledge Sources Source Code Analysis, API Documentation
Domains Data Preprocessing, NLP Pipeline, Language Detection
Last Updated 2026-02-07 14:00 GMT

Overview

Language Based Filtering is a data quality principle in the FastChat ShareGPT Data Pipeline that governs the selection or exclusion of conversations based on their detected natural language. By filtering conversations to specific languages or removing unwanted ones, the pipeline ensures that training data is linguistically homogeneous and appropriate for the target model's intended language capabilities.

Description

Language model fine-tuning benefits from controlled language composition in the training set. The Language Based Filtering principle addresses this through two complementary strategies and an additional data quality check:

Language Detection

Each conversation's text content is concatenated and analyzed using the polyglot library (backed by pycld2, the Compact Language Detector 2). The detector returns a language code (e.g., "en" for English, "ko" for Korean) that is used for filtering decisions. When detection fails -- due to very short text, mixed-language content, or encoding issues -- the language is labeled as "unknown".

Keep vs. Skip Strategies

The principle supports two mutually exclusive filtering strategies:

  • Keep Language (--keep-lang): Only retain conversations detected as a specific language (e.g., "en" for English). All other languages are discarded. The special value "all" disables this filter.
  • Skip Language (--skip-lang): Remove conversations in a specific language (e.g., "ko" for Korean) while keeping everything else. This is useful for excluding one problematic language without restricting to a single target.

These strategies cannot be combined simultaneously -- the pipeline enforces that either keep_lang is "all" or skip_lang is None.

Repetitive Pattern Detection

Beyond language filtering, this principle also encompasses the detection of repetitive digit patterns. Conversations containing sequences of 8 or more consecutive identical digits (matched by the regex (\d)\1{8}) are flagged as potentially low-quality or corrupted data. This heuristic catches cases where model outputs degenerate into repetitive number sequences. However, this filter is applied cautiously, as legitimate data (such as addresses or phone numbers) may contain long digit sequences.

Data Quality Through Language Homogeneity

Maintaining language homogeneity in training data is important for several reasons:

  • Models fine-tuned on mixed-language data may produce code-switched outputs
  • Tokenizer efficiency varies by language, and mixed data can lead to suboptimal token utilization
  • Evaluation metrics are easier to interpret when the training and test sets share the same language distribution

Usage

In the standard FastChat pipeline (as defined in prepare_all.py), language filtering is the second step, applied after HTML cleaning:

python3 -m fastchat.data.optional_clean \
    --in sharegpt_clean.json \
    --out sharegpt_clean_lang.json \
    --skip-lang ko

The default pipeline skips Korean conversations specifically, a decision based on observed data quality issues in the ShareGPT Korean subset.

Theoretical Basis

Language-based data filtering draws from several well-established principles in NLP and machine learning:

  • Domain adaptation: Fine-tuning is most effective when the training distribution matches the target distribution. If the model is intended for English conversations, non-English training data adds noise.
  • Curriculum learning: Controlling the language composition allows practitioners to design training curricula that gradually introduce multilingual capabilities.
  • Data quality heuristics: The repetitive digit pattern detector is a lightweight heuristic for identifying degenerate model outputs that were captured in the ShareGPT dataset. Such outputs provide no useful learning signal and can reinforce repetitive behavior in the fine-tuned model.
  • Language identification confidence: Using pycld2 provides reliable language detection for most major languages, though short texts or highly technical content (e.g., code-heavy conversations) may produce unreliable results. The "unknown" fallback handles these gracefully.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment