Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove LangStats

From Leeroopedia
Knowledge Sources
Domains Data Quality, Natural Language Processing
Last Updated 2026-02-14 17:00 GMT

Overview

LangStats is a statistics pipeline step that computes language identification confidence scores for documents using a FastText-based language classifier.

Description

LangStats extends BaseStats to measure how confidently a document is identified as belonging to a specific target language. It uses the FT176LID FastText language identification model, which supports 176 languages based on the model trained on data from the CC-100 corpus.

The class employs a two-stage lookup strategy for efficiency. It first checks whether the document's metadata already contains language and language_score fields (which may have been set by a prior language identification step in the pipeline). If the metadata indicates the document matches the target language, the existing score is reused. Otherwise, the FastText model is invoked to compute a fresh prediction. This caching behavior avoids redundant computation when language identification has already been performed upstream.

The output statistic is named fasttext_{language} where language is the target language code (e.g., "en" for English), and the value is the model's confidence score for that language.

Usage

Use LangStats when you need to profile the language distribution of a dataset or measure language identification confidence scores across documents. It is particularly useful for monitoring multilingual corpora or verifying that language filtering has been applied correctly.

Code Reference

Source Location

Signature

class LangStats(BaseStats):
    name = "🎤 Language stats"

    def __init__(
        self,
        output_folder: DataFolderLike,
        language: str,
        groups_to_compute: list[GROUP] = list(get_args(GROUP)),
        histogram_round_digits: int = 3,
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None

    def extract_stats(self, doc: Document) -> dict[str, int | float]:
        ...

Import

from datatrove.pipeline.stats.lang_stats import LangStats

I/O Contract

Inputs

Name Type Required Description
output_folder DataFolderLike Yes Folder where computed statistics will be saved
language str Yes Target language code to measure confidence for (e.g., "en", "fr", "de")
groups_to_compute list[GROUP] No Grouping strategies for statistics (default: all groups)
histogram_round_digits int No Decimal digits for histogram rounding (default: 3)
top_k_config TopKConfig No Top-K configuration for high-cardinality groups

Outputs

Name Type Description
fasttext_{language} float FastText confidence score for the target language (0.0 to 1.0)

Usage Examples

Basic Usage

from datatrove.pipeline.stats.lang_stats import LangStats

# Compute English language confidence statistics
stats = LangStats(
    output_folder="output/stats/",
    language="en",
)

Multiple Languages

from datatrove.pipeline.stats.lang_stats import LangStats

# Create separate stats steps for different languages
en_stats = LangStats(output_folder="output/stats/en/", language="en")
fr_stats = LangStats(output_folder="output/stats/fr/", language="fr")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment