Principle:Huggingface Datatrove Language Filtering
| Property | Value |
|---|---|
| Principle Name | Language_Filtering |
| Overview | Identifying and filtering documents by their natural language using statistical language identification models |
| Domains | Language_Identification, NLP, Data_Filtering |
| Related Implementation | Huggingface_Datatrove_LanguageFilter |
| Knowledge Sources | Huggingface_Datatrove, FastText_LID |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Language filtering identifies the natural language of each document in a data pipeline and removes documents that do not match specified target languages or that fall below a confidence threshold. This is a critical step in building monolingual or controlled-multilingual text corpora from web crawl data, where pages in hundreds of languages are intermixed.
Description
Language filtering uses pre-trained FastText models to predict the language of each document from character n-gram features. Datatrove supports two language identification backends:
- ft176 -- A FastText model trained on Wikipedia and Tatoeba data covering 176 languages. This is the default backend.
- GlotLID -- An alternative language identification model with broader language coverage, which additionally provides script identification (e.g., distinguishing Latin-script vs. Cyrillic-script text).
For each document, the model produces a predicted language label and a confidence score between 0 and 1. The filter applies two criteria:
- Language match -- If a list of target languages is specified, the document is kept only if one of the target languages has a confidence score above the threshold.
- Confidence threshold -- If no target language list is specified (accepting all languages), the document is kept only if the top-predicted language meets the threshold (default: 0.65).
The filter supports two operational modes:
- Filter mode (default) -- Documents that fail the criteria are removed from the pipeline.
- Label-only mode -- All documents are kept, but language metadata (
language,language_score) is annotated onto each document. This is useful for downstream analysis or conditional processing.
Usage
Language filtering is applied after text extraction to restrict the pipeline to specific target languages. In a typical pipeline:
- Read raw documents
- URL filtering
- HTML text extraction
- Language filtering (this principle)
- Quality and content filters
Theoretical Basis
- FastText language identification -- FastText models represent text using character n-grams (typically 2-grams through 5-grams), which capture subword patterns that are highly discriminative for language identification. These features are particularly effective because they capture morphological patterns and character distributions unique to each language, without requiring tokenization or word boundaries.
- Confidence thresholding -- The softmax output of the FastText classifier provides a confidence score. Setting a threshold (default: 0.65) rejects documents where the model is uncertain, which often indicates mixed-language content, very short text, or content that is not natural language (e.g., code, tables, or lists of numbers).
- Multi-label scoring -- Rather than relying solely on the top-1 prediction, the filter can examine scores for all target languages. A document written primarily in Spanish but with some Portuguese sections may have high scores for both languages; if either target language exceeds the threshold, the document is kept.
- Character n-gram features -- Unlike word-level features, character n-grams are language-agnostic in their construction (no tokenizer required) and naturally handle languages without whitespace word boundaries (e.g., Chinese, Japanese, Thai). The ft176 model covers 176 languages using this approach.
- GlotLID script detection -- The GlotLID backend additionally identifies the script (writing system) of the document. This is valuable for languages written in multiple scripts (e.g., Serbian in both Cyrillic and Latin script, or Hindi in both Devanagari and Latin transliteration).
Related Pages
- Huggingface_Datatrove_LanguageFilter (implements this principle) -- Concrete filter class for language identification and filtering
- Huggingface_Datatrove_HTML_Text_Extraction (upstream step) -- Text extraction that produces the plain text input for language identification
- Huggingface_Datatrove_URL_Filtering (upstream step) -- URL-based filtering applied before language filtering
- Huggingface_Datatrove_Random_Sampling (downstream step) -- Random sampling that may follow language filtering