Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove LanguageIdentifier

From Leeroopedia
Knowledge Sources
Domains NLP, Language Identification, Data Processing
Last Updated 2026-02-14 17:00 GMT

Overview

Provides an abstract base class and FastText-based implementations for language identification of text documents.

Description

The lid.py module defines a hierarchy of language identification (LID) classes used to detect the language of a given Document. At the top of the hierarchy is the abstract LID base class, which declares a predict method that returns the best-matching language alongside confidence scores for all requested languages.

FastTextLID extends the base class with a concrete implementation powered by Facebook's FastText library. It lazily loads a FastText model from a cached or downloaded binary file, then predicts language labels by running the model on the document text (with newlines replaced by spaces). The k parameter controls how many top-k language predictions the model returns, defaulting to all languages when set to -1.

Two concrete subclasses are provided out of the box: FT176LID, which uses the standard FastText lid.176 model trained on 176 languages, and GlotLID, which loads models from the GlotLID repository on Hugging Face Hub with configurable version strings (defaulting to v3). Both inherit the lazy model loading and prediction logic from FastTextLID.

Usage

Use these classes when building data processing pipelines that need to classify or filter documents by language. The LanguageFilter pipeline step delegates to these LID classes internally. You can also instantiate them directly for standalone language detection tasks.

Code Reference

Source Location

Signature

class LID:
    def __init__(self, languages: list[str] | None = None) -> None: ...
    def predict(self, doc: Document) -> tuple[tuple[str, int], dict[str, float]]: ...

class FastTextLID(LID):
    MODEL_URL = None
    MODEL_SUBFOLDER = None
    def __init__(self, languages: list[str] | None = None, k: int = -1) -> None: ...
    def predict(self, doc: Document) -> tuple[tuple[str, int], dict[str, float]]: ...

class FT176LID(FastTextLID):
    MODEL_URL = "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin"
    MODEL_SUBFOLDER = "ft176"

class GlotLID(FastTextLID):
    MODEL_SUBFOLDER = "glotlid"
    def __init__(self, languages: list[str] | None = None, k: int = -1, version: str = "v3") -> None: ...

Import

from datatrove.utils.lid import LID, FastTextLID, FT176LID, GlotLID

I/O Contract

Inputs

Name Type Required Description
languages list[str] or None No List of language codes to report scores for; None returns all detected languages
k int No Number of top-k languages to retrieve from the model; -1 means all (FastTextLID and subclasses)
version str No GlotLID model version string, e.g. "v3" (GlotLID only)
doc Document Yes The document whose text will be analyzed for language (passed to predict)

Outputs

Name Type Description
best_lang_pair tuple[str, float] A tuple of the best predicted language code and its confidence score
lang_scores dict[str, float] A dictionary mapping each requested language code to its prediction score

Usage Examples

Basic Usage

from datatrove.data import Document
from datatrove.utils.lid import FT176LID

lid = FT176LID(languages=["en", "fr", "de"])

doc = Document(text="This is a sample English document.", id="doc1")
best_lang, all_scores = lid.predict(doc)

print(best_lang)    # ("en", 0.98)
print(all_scores)   # {"en": 0.98, "fr": 0.01, "de": 0.005}

Using GlotLID

from datatrove.data import Document
from datatrove.utils.lid import GlotLID

lid = GlotLID(languages=["eng_Latn", "fra_Latn"], version="v3")

doc = Document(text="Ceci est un document en francais.", id="doc2")
best_lang, all_scores = lid.predict(doc)

print(best_lang)    # ("fra_Latn", 0.95)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment