Implementation:Huggingface Datatrove PerplexityStats

Knowledge Sources	Huggingface_Datatrove
Domains	Data Quality, Natural Language Processing
Last Updated	2026-02-14 17:00 GMT

Overview

CCNetPerplexityStats is a statistics pipeline step that computes perplexity scores for documents using KenLM language models, following the CCNet methodology.

Description

CCNetPerplexityStats extends BaseStats to compute document-level perplexity using KenLM n-gram language models. Perplexity is a standard metric for measuring how well a language model predicts a text: lower perplexity indicates text that is more "expected" by the model (typically well-formed prose in the model's training language), while higher perplexity indicates unusual, noisy, or out-of-domain text.

The class instantiates a KenlmModel with a specified model_dataset (identifying which pre-trained KenLM model to use) and a language parameter. The CCNet approach, originally developed by Facebook Research, uses perplexity from language models trained on Wikipedia text to classify web documents into quality tiers. Documents with low perplexity are considered high quality (Wikipedia-like), while those with high perplexity are considered lower quality.

The output statistic is named ccnet_perplexity_{model_dataset}_{language} and contains the raw perplexity score for each document. The class requires the kenlm Python package in addition to the base tldextract dependency.

Usage

Use CCNetPerplexityStats when you need to compute perplexity-based quality scores for documents, following the CCNet methodology. This is particularly valuable for quality-tiered filtering of web-crawled data, where perplexity serves as a proxy for text quality relative to a reference corpus.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/stats/perplexity_stats.py
Lines: 1-37

Signature

class CCNetPerplexityStats(BaseStats):
    name = "🤯 CCNet perplexity stats"
    _requires_dependencies = BaseStats._requires_dependencies + ["kenlm"]

    def __init__(
        self,
        output_folder: DataFolderLike,
        model_dataset: str,
        language: str = Languages.english,
        histogram_round_digits: int = 3,
        groups_to_compute: list[GROUP] = list(get_args(GROUP)),
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None

    def extract_stats(self, doc: Document) -> dict[str, int | float]:
        ...

Import

from datatrove.pipeline.stats.perplexity_stats import CCNetPerplexityStats

I/O Contract

Inputs

Name	Type	Required	Description
output_folder	DataFolderLike	Yes	Folder where computed statistics will be saved
model_dataset	str	Yes	Identifier for the pre-trained KenLM model dataset to use
language	str	No	Language code for the KenLM model (default: Languages.english)
histogram_round_digits	int	No	Decimal digits for histogram rounding (default: 3)
groups_to_compute	list[GROUP]	No	Grouping strategies for statistics (default: all groups)
top_k_config	TopKConfig	No	Top-K configuration for high-cardinality groups

Outputs

Name	Type	Description
ccnet_perplexity_{model_dataset}_{language}	float	KenLM perplexity score for the document; lower values indicate text more similar to the model's training data

Usage Examples

Basic Usage

from datatrove.pipeline.stats.perplexity_stats import CCNetPerplexityStats

# Compute perplexity using a Wikipedia-trained KenLM model
stats = CCNetPerplexityStats(
    output_folder="output/stats/",
    model_dataset="wikipedia",
    language="en",
)

Multiple Languages

from datatrove.pipeline.stats.perplexity_stats import CCNetPerplexityStats

# Compute perplexity for French text
stats = CCNetPerplexityStats(
    output_folder="output/stats/fr/",
    model_dataset="wikipedia",
    language="fr",
    groups_to_compute=["summary", "histogram"],
)

Related Pages

Principle:Huggingface_Datatrove_Perplexity_Statistics

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment