Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove LanguageFilter

From Leeroopedia
Knowledge Sources
Domains Language_Identification, NLP, Data_Filtering
Type Filter Module
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete filter class that identifies the natural language of each document using a FastText-based language identification model and removes documents that do not match specified target languages or fall below a confidence threshold.

Description

The LanguageFilter class extends BaseFilter and wraps two FastText-based language identification backends: FT176LID (176-language model) and GlotLID (broader coverage with script detection). On each call to filter(), the model predicts language scores for the document text, annotates the document metadata with language and language_score, and returns whether the document should be kept.

Operational modes:

  • Filter mode (default, label_only=False) -- Documents are rejected if no target language exceeds the threshold (when languages are specified) or if the top language score is below the threshold (when languages is None).
  • Label-only mode (label_only=True) -- All documents are kept, but language metadata is annotated. Useful for analysis pipelines.

GlotLID-specific behavior: When using the glotlid backend, the predicted label has the format lang_script (e.g., eng_Latn). The filter splits this into separate language and language_script metadata fields.

Top-pairs tracking: When keep_top_pairs_threshold is set to a non-negative value, all language predictions with scores above that threshold are stored in metadata as top_language_{lang}_score keys.

Usage

Use LanguageFilter after text extraction to restrict a datatrove pipeline to specific target languages, or in label-only mode to annotate documents with language metadata for downstream analysis.

Code Reference

Source Location

Signature

class LanguageFilter(BaseFilter):
    name = "Language ID"
    _requires_dependencies = [("fasttext", "fasttext-numpy2-wheel"), "fasteners"]

    def __init__(
        self,
        languages: list[str] | str | None = None,
        language_threshold: float = 0.65,
        exclusion_writer: DiskWriter = None,
        backend: Literal["ft176", "glotlid"] = "ft176",
        label_only: bool = False,
        keep_top_pairs_threshold: float = -1,
    ):
        ...

    def filter(self, doc: Document) -> bool:
        ...

Import

from datatrove.pipeline.filters import LanguageFilter

I/O Contract

Inputs

Name Type Required Description
languages str | None No (default: None) List of language codes to keep (e.g., ["en", "fr"]). A single string is auto-wrapped in a list. None accepts all languages above the threshold.
language_threshold float No (default: 0.65) Minimum confidence score to accept a document
exclusion_writer DiskWriter No (default: None) Optional writer to save rejected documents
backend Literal["ft176", "glotlid"] No (default: "ft176") Language identification model backend
label_only bool No (default: False) If True, annotate language metadata without removing any documents
keep_top_pairs_threshold float No (default: -1) Store all language predictions above this score in metadata. Set to -1 to disable.

Pipeline Input: A Document object with plain text in its .text field.

Outputs

Name Type Description
bool bool True if the document passes language criteria and should be kept

Metadata annotations added to each document:

Key Type Description
language str Predicted language code (e.g., "en", "fr")
language_score float Confidence score for the predicted language
language_script str Script identifier (only when backend="glotlid", e.g., "Latn")
top_language_{lang}_score float Score for each language above keep_top_pairs_threshold (when enabled)

Usage Examples

Filter to English Only

from datatrove.pipeline.filters import LanguageFilter

lang_filter = LanguageFilter(
    languages=["en"],
    language_threshold=0.65,
)

Multi-Language Filter with GlotLID

from datatrove.pipeline.filters import LanguageFilter

lang_filter = LanguageFilter(
    languages=["en", "fr", "de", "es"],
    language_threshold=0.5,
    backend="glotlid",
)

Label-Only Mode for Analysis

from datatrove.pipeline.filters import LanguageFilter

# Annotate all documents with language info without removing any
lang_annotator = LanguageFilter(
    label_only=True,
    keep_top_pairs_threshold=0.1,
)

Full Pipeline Example

from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import LanguageFilter, URLFilter
from datatrove.pipeline.readers import WarcReader

pipeline = LocalPipelineExecutor(
    pipeline=[
        WarcReader("s3://commoncrawl/crawl-data/CC-MAIN-2024-10/"),
        Trafilatura(favour_precision=True, timeout=1),
        URLFilter(),
        LanguageFilter(languages=["en"], language_threshold=0.65),
    ],
    tasks=100,
)
pipeline.run()

Related Pages

Principle:Huggingface_Datatrove_Language_Filtering

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment