Principle:Huggingface Datatrove Language Filtering

Property	Value
Principle Name	Language_Filtering
Overview	Identifying and filtering documents by their natural language using statistical language identification models
Domains	Language_Identification, NLP, Data_Filtering
Related Implementation	Huggingface_Datatrove_LanguageFilter
Knowledge Sources	Huggingface_Datatrove, FastText_LID
Last Updated	2026-02-14 00:00 GMT

Overview

Language filtering identifies the natural language of each document in a data pipeline and removes documents that do not match specified target languages or that fall below a confidence threshold. This is a critical step in building monolingual or controlled-multilingual text corpora from web crawl data, where pages in hundreds of languages are intermixed.

Description

Language filtering uses pre-trained FastText models to predict the language of each document from character n-gram features. Datatrove supports two language identification backends:

ft176 -- A FastText model trained on Wikipedia and Tatoeba data covering 176 languages. This is the default backend.
GlotLID -- An alternative language identification model with broader language coverage, which additionally provides script identification (e.g., distinguishing Latin-script vs. Cyrillic-script text).

For each document, the model produces a predicted language label and a confidence score between 0 and 1. The filter applies two criteria:

Language match -- If a list of target languages is specified, the document is kept only if one of the target languages has a confidence score above the threshold.
Confidence threshold -- If no target language list is specified (accepting all languages), the document is kept only if the top-predicted language meets the threshold (default: 0.65).

The filter supports two operational modes:

Filter mode (default) -- Documents that fail the criteria are removed from the pipeline.
Label-only mode -- All documents are kept, but language metadata (language, language_score) is annotated onto each document. This is useful for downstream analysis or conditional processing.

Usage

Language filtering is applied after text extraction to restrict the pipeline to specific target languages. In a typical pipeline:

Read raw documents
URL filtering
HTML text extraction
Language filtering (this principle)
Quality and content filters

Theoretical Basis

FastText language identification -- FastText models represent text using character n-grams (typically 2-grams through 5-grams), which capture subword patterns that are highly discriminative for language identification. These features are particularly effective because they capture morphological patterns and character distributions unique to each language, without requiring tokenization or word boundaries.
Confidence thresholding -- The softmax output of the FastText classifier provides a confidence score. Setting a threshold (default: 0.65) rejects documents where the model is uncertain, which often indicates mixed-language content, very short text, or content that is not natural language (e.g., code, tables, or lists of numbers).
Multi-label scoring -- Rather than relying solely on the top-1 prediction, the filter can examine scores for all target languages. A document written primarily in Spanish but with some Portuguese sections may have high scores for both languages; if either target language exceeds the threshold, the document is kept.
Character n-gram features -- Unlike word-level features, character n-grams are language-agnostic in their construction (no tokenizer required) and naturally handle languages without whitespace word boundaries (e.g., Chinese, Japanese, Thai). The ft176 model covers 176 languages using this approach.
GlotLID script detection -- The GlotLID backend additionally identifies the script (writing system) of the document. This is valuable for languages written in multiple scripts (e.g., Serbian in both Cyrillic and Latin script, or Hindi in both Devanagari and Latin transliteration).

Related Pages

Huggingface_Datatrove_LanguageFilter (implements this principle) -- Concrete filter class for language identification and filtering
Huggingface_Datatrove_HTML_Text_Extraction (upstream step) -- Text extraction that produces the plain text input for language identification
Huggingface_Datatrove_URL_Filtering (upstream step) -- URL-based filtering applied before language filtering
Huggingface_Datatrove_Random_Sampling (downstream step) -- Random sampling that may follow language filtering

Implementation:Huggingface_Datatrove_LanguageFilter

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment