Implementation:Huggingface Datatrove LangStats
| Knowledge Sources | |
|---|---|
| Domains | Data Quality, Natural Language Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
LangStats is a statistics pipeline step that computes language identification confidence scores for documents using a FastText-based language classifier.
Description
LangStats extends BaseStats to measure how confidently a document is identified as belonging to a specific target language. It uses the FT176LID FastText language identification model, which supports 176 languages based on the model trained on data from the CC-100 corpus.
The class employs a two-stage lookup strategy for efficiency. It first checks whether the document's metadata already contains language and language_score fields (which may have been set by a prior language identification step in the pipeline). If the metadata indicates the document matches the target language, the existing score is reused. Otherwise, the FastText model is invoked to compute a fresh prediction. This caching behavior avoids redundant computation when language identification has already been performed upstream.
The output statistic is named fasttext_{language} where language is the target language code (e.g., "en" for English), and the value is the model's confidence score for that language.
Usage
Use LangStats when you need to profile the language distribution of a dataset or measure language identification confidence scores across documents. It is particularly useful for monitoring multilingual corpora or verifying that language filtering has been applied correctly.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/stats/lang_stats.py
- Lines: 1-38
Signature
class LangStats(BaseStats):
name = "🎤 Language stats"
def __init__(
self,
output_folder: DataFolderLike,
language: str,
groups_to_compute: list[GROUP] = list(get_args(GROUP)),
histogram_round_digits: int = 3,
top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
) -> None
def extract_stats(self, doc: Document) -> dict[str, int | float]:
...
Import
from datatrove.pipeline.stats.lang_stats import LangStats
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes | Folder where computed statistics will be saved |
| language | str | Yes | Target language code to measure confidence for (e.g., "en", "fr", "de") |
| groups_to_compute | list[GROUP] | No | Grouping strategies for statistics (default: all groups) |
| histogram_round_digits | int | No | Decimal digits for histogram rounding (default: 3) |
| top_k_config | TopKConfig | No | Top-K configuration for high-cardinality groups |
Outputs
| Name | Type | Description |
|---|---|---|
| fasttext_{language} | float | FastText confidence score for the target language (0.0 to 1.0) |
Usage Examples
Basic Usage
from datatrove.pipeline.stats.lang_stats import LangStats
# Compute English language confidence statistics
stats = LangStats(
output_folder="output/stats/",
language="en",
)
Multiple Languages
from datatrove.pipeline.stats.lang_stats import LangStats
# Create separate stats steps for different languages
en_stats = LangStats(output_folder="output/stats/en/", language="en")
fr_stats = LangStats(output_folder="output/stats/fr/", language="fr")