Implementation:Huggingface Datatrove LanguageIdentifier
| Knowledge Sources | |
|---|---|
| Domains | NLP, Language Identification, Data Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Provides an abstract base class and FastText-based implementations for language identification of text documents.
Description
The lid.py module defines a hierarchy of language identification (LID) classes used to detect the language of a given Document. At the top of the hierarchy is the abstract LID base class, which declares a predict method that returns the best-matching language alongside confidence scores for all requested languages.
FastTextLID extends the base class with a concrete implementation powered by Facebook's FastText library. It lazily loads a FastText model from a cached or downloaded binary file, then predicts language labels by running the model on the document text (with newlines replaced by spaces). The k parameter controls how many top-k language predictions the model returns, defaulting to all languages when set to -1.
Two concrete subclasses are provided out of the box: FT176LID, which uses the standard FastText lid.176 model trained on 176 languages, and GlotLID, which loads models from the GlotLID repository on Hugging Face Hub with configurable version strings (defaulting to v3). Both inherit the lazy model loading and prediction logic from FastTextLID.
Usage
Use these classes when building data processing pipelines that need to classify or filter documents by language. The LanguageFilter pipeline step delegates to these LID classes internally. You can also instantiate them directly for standalone language detection tasks.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/utils/lid.py
- Lines: 1-81
Signature
class LID:
def __init__(self, languages: list[str] | None = None) -> None: ...
def predict(self, doc: Document) -> tuple[tuple[str, int], dict[str, float]]: ...
class FastTextLID(LID):
MODEL_URL = None
MODEL_SUBFOLDER = None
def __init__(self, languages: list[str] | None = None, k: int = -1) -> None: ...
def predict(self, doc: Document) -> tuple[tuple[str, int], dict[str, float]]: ...
class FT176LID(FastTextLID):
MODEL_URL = "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin"
MODEL_SUBFOLDER = "ft176"
class GlotLID(FastTextLID):
MODEL_SUBFOLDER = "glotlid"
def __init__(self, languages: list[str] | None = None, k: int = -1, version: str = "v3") -> None: ...
Import
from datatrove.utils.lid import LID, FastTextLID, FT176LID, GlotLID
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| languages | list[str] or None | No | List of language codes to report scores for; None returns all detected languages |
| k | int | No | Number of top-k languages to retrieve from the model; -1 means all (FastTextLID and subclasses) |
| version | str | No | GlotLID model version string, e.g. "v3" (GlotLID only) |
| doc | Document | Yes | The document whose text will be analyzed for language (passed to predict) |
Outputs
| Name | Type | Description |
|---|---|---|
| best_lang_pair | tuple[str, float] | A tuple of the best predicted language code and its confidence score |
| lang_scores | dict[str, float] | A dictionary mapping each requested language code to its prediction score |
Usage Examples
Basic Usage
from datatrove.data import Document
from datatrove.utils.lid import FT176LID
lid = FT176LID(languages=["en", "fr", "de"])
doc = Document(text="This is a sample English document.", id="doc1")
best_lang, all_scores = lid.predict(doc)
print(best_lang) # ("en", 0.98)
print(all_scores) # {"en": 0.98, "fr": 0.01, "de": 0.005}
Using GlotLID
from datatrove.data import Document
from datatrove.utils.lid import GlotLID
lid = GlotLID(languages=["eng_Latn", "fra_Latn"], version="v3")
doc = Document(text="Ceci est un document en francais.", id="doc2")
best_lang, all_scores = lid.predict(doc)
print(best_lang) # ("fra_Latn", 0.95)