Implementation:Huggingface Datatrove LanguageIdentifier

Knowledge Sources	Huggingface_Datatrove
Domains	NLP, Language Identification, Data Processing
Last Updated	2026-02-14 17:00 GMT

Overview

Provides an abstract base class and FastText-based implementations for language identification of text documents.

Description

The lid.py module defines a hierarchy of language identification (LID) classes used to detect the language of a given Document. At the top of the hierarchy is the abstract LID base class, which declares a predict method that returns the best-matching language alongside confidence scores for all requested languages.

FastTextLID extends the base class with a concrete implementation powered by Facebook's FastText library. It lazily loads a FastText model from a cached or downloaded binary file, then predicts language labels by running the model on the document text (with newlines replaced by spaces). The k parameter controls how many top-k language predictions the model returns, defaulting to all languages when set to -1.

Two concrete subclasses are provided out of the box: FT176LID, which uses the standard FastText lid.176 model trained on 176 languages, and GlotLID, which loads models from the GlotLID repository on Hugging Face Hub with configurable version strings (defaulting to v3). Both inherit the lazy model loading and prediction logic from FastTextLID.

Usage

Use these classes when building data processing pipelines that need to classify or filter documents by language. The LanguageFilter pipeline step delegates to these LID classes internally. You can also instantiate them directly for standalone language detection tasks.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/utils/lid.py
Lines: 1-81

Signature

class LID:
    def __init__(self, languages: list[str] | None = None) -> None: ...
    def predict(self, doc: Document) -> tuple[tuple[str, int], dict[str, float]]: ...

class FastTextLID(LID):
    MODEL_URL = None
    MODEL_SUBFOLDER = None
    def __init__(self, languages: list[str] | None = None, k: int = -1) -> None: ...
    def predict(self, doc: Document) -> tuple[tuple[str, int], dict[str, float]]: ...

class FT176LID(FastTextLID):
    MODEL_URL = "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin"
    MODEL_SUBFOLDER = "ft176"

class GlotLID(FastTextLID):
    MODEL_SUBFOLDER = "glotlid"
    def __init__(self, languages: list[str] | None = None, k: int = -1, version: str = "v3") -> None: ...

Import

from datatrove.utils.lid import LID, FastTextLID, FT176LID, GlotLID

I/O Contract

Inputs

Name	Type	Required	Description
languages	list[str] or None	No	List of language codes to report scores for; None returns all detected languages
k	int	No	Number of top-k languages to retrieve from the model; -1 means all (FastTextLID and subclasses)
version	str	No	GlotLID model version string, e.g. "v3" (GlotLID only)
doc	Document	Yes	The document whose text will be analyzed for language (passed to predict)

Outputs

Name	Type	Description
best_lang_pair	tuple[str, float]	A tuple of the best predicted language code and its confidence score
lang_scores	dict[str, float]	A dictionary mapping each requested language code to its prediction score

Usage Examples

Basic Usage

from datatrove.data import Document
from datatrove.utils.lid import FT176LID

lid = FT176LID(languages=["en", "fr", "de"])

doc = Document(text="This is a sample English document.", id="doc1")
best_lang, all_scores = lid.predict(doc)

print(best_lang)    # ("en", 0.98)
print(all_scores)   # {"en": 0.98, "fr": 0.01, "de": 0.005}

Using GlotLID

from datatrove.data import Document
from datatrove.utils.lid import GlotLID

lid = GlotLID(languages=["eng_Latn", "fra_Latn"], version="v3")

doc = Document(text="Ceci est un document en francais.", id="doc2")
best_lang, all_scores = lid.predict(doc)

print(best_lang)    # ("fra_Latn", 0.95)

Related Pages

Principle:Huggingface_Datatrove_Language_Identification

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment