Implementation:Huggingface Datatrove LangStats

Knowledge Sources	Huggingface_Datatrove
Domains	Data Quality, Natural Language Processing
Last Updated	2026-02-14 17:00 GMT

Overview

LangStats is a statistics pipeline step that computes language identification confidence scores for documents using a FastText-based language classifier.

Description

LangStats extends BaseStats to measure how confidently a document is identified as belonging to a specific target language. It uses the FT176LID FastText language identification model, which supports 176 languages based on the model trained on data from the CC-100 corpus.

The class employs a two-stage lookup strategy for efficiency. It first checks whether the document's metadata already contains language and language_score fields (which may have been set by a prior language identification step in the pipeline). If the metadata indicates the document matches the target language, the existing score is reused. Otherwise, the FastText model is invoked to compute a fresh prediction. This caching behavior avoids redundant computation when language identification has already been performed upstream.

The output statistic is named fasttext_{language} where language is the target language code (e.g., "en" for English), and the value is the model's confidence score for that language.

Usage

Use LangStats when you need to profile the language distribution of a dataset or measure language identification confidence scores across documents. It is particularly useful for monitoring multilingual corpora or verifying that language filtering has been applied correctly.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/stats/lang_stats.py
Lines: 1-38

Signature

class LangStats(BaseStats):
    name = "🎤 Language stats"

    def __init__(
        self,
        output_folder: DataFolderLike,
        language: str,
        groups_to_compute: list[GROUP] = list(get_args(GROUP)),
        histogram_round_digits: int = 3,
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None

    def extract_stats(self, doc: Document) -> dict[str, int | float]:
        ...

Import

from datatrove.pipeline.stats.lang_stats import LangStats

I/O Contract

Inputs

Name	Type	Required	Description
output_folder	DataFolderLike	Yes	Folder where computed statistics will be saved
language	str	Yes	Target language code to measure confidence for (e.g., "en", "fr", "de")
groups_to_compute	list[GROUP]	No	Grouping strategies for statistics (default: all groups)
histogram_round_digits	int	No	Decimal digits for histogram rounding (default: 3)
top_k_config	TopKConfig	No	Top-K configuration for high-cardinality groups

Outputs

Name	Type	Description
fasttext_{language}	float	FastText confidence score for the target language (0.0 to 1.0)

Usage Examples

Basic Usage

from datatrove.pipeline.stats.lang_stats import LangStats

# Compute English language confidence statistics
stats = LangStats(
    output_folder="output/stats/",
    language="en",
)

Multiple Languages

from datatrove.pipeline.stats.lang_stats import LangStats

# Create separate stats steps for different languages
en_stats = LangStats(output_folder="output/stats/en/", language="en")
fr_stats = LangStats(output_folder="output/stats/fr/", language="fr")

Related Pages

Principle:Huggingface_Datatrove_Language_Statistics

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment