Principle:Lm sys FastChat Language Based Filtering
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Language Based Filtering |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source Code Analysis, API Documentation |
| Domains | Data Preprocessing, NLP Pipeline, Language Detection |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Language Based Filtering is a data quality principle in the FastChat ShareGPT Data Pipeline that governs the selection or exclusion of conversations based on their detected natural language. By filtering conversations to specific languages or removing unwanted ones, the pipeline ensures that training data is linguistically homogeneous and appropriate for the target model's intended language capabilities.
Description
Language model fine-tuning benefits from controlled language composition in the training set. The Language Based Filtering principle addresses this through two complementary strategies and an additional data quality check:
Language Detection
Each conversation's text content is concatenated and analyzed using the polyglot library (backed by pycld2, the Compact Language Detector 2). The detector returns a language code (e.g., "en" for English, "ko" for Korean) that is used for filtering decisions. When detection fails -- due to very short text, mixed-language content, or encoding issues -- the language is labeled as "unknown".
Keep vs. Skip Strategies
The principle supports two mutually exclusive filtering strategies:
- Keep Language (
--keep-lang): Only retain conversations detected as a specific language (e.g., "en" for English). All other languages are discarded. The special value "all" disables this filter. - Skip Language (
--skip-lang): Remove conversations in a specific language (e.g., "ko" for Korean) while keeping everything else. This is useful for excluding one problematic language without restricting to a single target.
These strategies cannot be combined simultaneously -- the pipeline enforces that either keep_lang is "all" or skip_lang is None.
Repetitive Pattern Detection
Beyond language filtering, this principle also encompasses the detection of repetitive digit patterns. Conversations containing sequences of 8 or more consecutive identical digits (matched by the regex (\d)\1{8}) are flagged as potentially low-quality or corrupted data. This heuristic catches cases where model outputs degenerate into repetitive number sequences. However, this filter is applied cautiously, as legitimate data (such as addresses or phone numbers) may contain long digit sequences.
Data Quality Through Language Homogeneity
Maintaining language homogeneity in training data is important for several reasons:
- Models fine-tuned on mixed-language data may produce code-switched outputs
- Tokenizer efficiency varies by language, and mixed data can lead to suboptimal token utilization
- Evaluation metrics are easier to interpret when the training and test sets share the same language distribution
Usage
In the standard FastChat pipeline (as defined in prepare_all.py), language filtering is the second step, applied after HTML cleaning:
python3 -m fastchat.data.optional_clean \
--in sharegpt_clean.json \
--out sharegpt_clean_lang.json \
--skip-lang ko
The default pipeline skips Korean conversations specifically, a decision based on observed data quality issues in the ShareGPT Korean subset.
Theoretical Basis
Language-based data filtering draws from several well-established principles in NLP and machine learning:
- Domain adaptation: Fine-tuning is most effective when the training distribution matches the target distribution. If the model is intended for English conversations, non-English training data adds noise.
- Curriculum learning: Controlling the language composition allows practitioners to design training curricula that gradually introduce multilingual capabilities.
- Data quality heuristics: The repetitive digit pattern detector is a lightweight heuristic for identifying degenerate model outputs that were captured in the ShareGPT dataset. Such outputs provide no useful learning signal and can reinforce repetitive behavior in the fine-tuned model.
- Language identification confidence: Using pycld2 provides reliable language detection for most major languages, though short texts or highly technical content (e.g., code-heavy conversations) may produce unreliable results. The "unknown" fallback handles these gracefully.
Related Pages
- Implementation:Lm_sys_FastChat_Optional_Clean
- Implementation:Lm_sys_FastChat_Optional_Clean -- The implementation that realizes this principle
- Principle:Lm_sys_FastChat_ShareGPT_HTML_Cleaning -- Previous pipeline stage: HTML cleaning
- Principle:Lm_sys_FastChat_Long_Conversation_Splitting -- Next pipeline stage: conversation splitting