Implementation:Huggingface Datatrove LanguageFilter

Knowledge Sources	Huggingface_Datatrove FastText_LID
Domains	Language_Identification, NLP, Data_Filtering
Type	Filter Module
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete filter class that identifies the natural language of each document using a FastText-based language identification model and removes documents that do not match specified target languages or fall below a confidence threshold.

Description

The LanguageFilter class extends BaseFilter and wraps two FastText-based language identification backends: FT176LID (176-language model) and GlotLID (broader coverage with script detection). On each call to filter(), the model predicts language scores for the document text, annotates the document metadata with language and language_score, and returns whether the document should be kept.

Operational modes:

Filter mode (default, label_only=False) -- Documents are rejected if no target language exceeds the threshold (when languages are specified) or if the top language score is below the threshold (when languages is None).
Label-only mode (label_only=True) -- All documents are kept, but language metadata is annotated. Useful for analysis pipelines.

GlotLID-specific behavior: When using the glotlid backend, the predicted label has the format lang_script (e.g., eng_Latn). The filter splits this into separate language and language_script metadata fields.

Top-pairs tracking: When keep_top_pairs_threshold is set to a non-negative value, all language predictions with scores above that threshold are stored in metadata as top_language_{lang}_score keys.

Usage

Use LanguageFilter after text extraction to restrict a datatrove pipeline to specific target languages, or in label-only mode to annotate documents with language metadata for downstream analysis.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/filters/language_filter.py
Lines: 9-65

Signature

class LanguageFilter(BaseFilter):
    name = "Language ID"
    _requires_dependencies = [("fasttext", "fasttext-numpy2-wheel"), "fasteners"]

    def __init__(
        self,
        languages: list[str] | str | None = None,
        language_threshold: float = 0.65,
        exclusion_writer: DiskWriter = None,
        backend: Literal["ft176", "glotlid"] = "ft176",
        label_only: bool = False,
        keep_top_pairs_threshold: float = -1,
    ):
        ...

    def filter(self, doc: Document) -> bool:
        ...

Import

from datatrove.pipeline.filters import LanguageFilter

I/O Contract

Inputs

Name	Type	Required	Description
languages	str \| None	No (default: None)	List of language codes to keep (e.g., `["en", "fr"]`). A single string is auto-wrapped in a list. `None` accepts all languages above the threshold.
language_threshold	float	No (default: 0.65)	Minimum confidence score to accept a document
exclusion_writer	DiskWriter	No (default: None)	Optional writer to save rejected documents
backend	Literal["ft176", "glotlid"]	No (default: "ft176")	Language identification model backend
label_only	bool	No (default: False)	If True, annotate language metadata without removing any documents
keep_top_pairs_threshold	float	No (default: -1)	Store all language predictions above this score in metadata. Set to -1 to disable.

Pipeline Input: A Document object with plain text in its .text field.

Outputs

Name	Type	Description
bool	bool	`True` if the document passes language criteria and should be kept

Metadata annotations added to each document:

Key	Type	Description
`language`	str	Predicted language code (e.g., `"en"`, `"fr"`)
`language_score`	float	Confidence score for the predicted language
`language_script`	str	Script identifier (only when `backend="glotlid"`, e.g., `"Latn"`)
`top_language_{lang}_score`	float	Score for each language above `keep_top_pairs_threshold` (when enabled)

Usage Examples

Filter to English Only

from datatrove.pipeline.filters import LanguageFilter

lang_filter = LanguageFilter(
    languages=["en"],
    language_threshold=0.65,
)

Multi-Language Filter with GlotLID

from datatrove.pipeline.filters import LanguageFilter

lang_filter = LanguageFilter(
    languages=["en", "fr", "de", "es"],
    language_threshold=0.5,
    backend="glotlid",
)

Label-Only Mode for Analysis

from datatrove.pipeline.filters import LanguageFilter

# Annotate all documents with language info without removing any
lang_annotator = LanguageFilter(
    label_only=True,
    keep_top_pairs_threshold=0.1,
)

Full Pipeline Example

from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import LanguageFilter, URLFilter
from datatrove.pipeline.readers import WarcReader

pipeline = LocalPipelineExecutor(
    pipeline=[
        WarcReader("s3://commoncrawl/crawl-data/CC-MAIN-2024-10/"),
        Trafilatura(favour_precision=True, timeout=1),
        URLFilter(),
        LanguageFilter(languages=["en"], language_threshold=0.65),
    ],
    tasks=100,
)
pipeline.run()

Related Pages

Huggingface_Datatrove_Language_Filtering (principle) -- The principle this implementation realizes
Huggingface_Datatrove_Trafilatura (upstream step) -- HTML text extraction that produces the plain text input
Huggingface_Datatrove_URLFilter (upstream filter) -- URL-based filtering applied before language filtering
Huggingface_Datatrove_SamplerFilter (downstream filter) -- Random sampling that may follow language filtering

Principle:Huggingface_Datatrove_Language_Filtering

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment