Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove PerplexityStats

From Leeroopedia
Knowledge Sources
Domains Data Quality, Natural Language Processing
Last Updated 2026-02-14 17:00 GMT

Overview

CCNetPerplexityStats is a statistics pipeline step that computes perplexity scores for documents using KenLM language models, following the CCNet methodology.

Description

CCNetPerplexityStats extends BaseStats to compute document-level perplexity using KenLM n-gram language models. Perplexity is a standard metric for measuring how well a language model predicts a text: lower perplexity indicates text that is more "expected" by the model (typically well-formed prose in the model's training language), while higher perplexity indicates unusual, noisy, or out-of-domain text.

The class instantiates a KenlmModel with a specified model_dataset (identifying which pre-trained KenLM model to use) and a language parameter. The CCNet approach, originally developed by Facebook Research, uses perplexity from language models trained on Wikipedia text to classify web documents into quality tiers. Documents with low perplexity are considered high quality (Wikipedia-like), while those with high perplexity are considered lower quality.

The output statistic is named ccnet_perplexity_{model_dataset}_{language} and contains the raw perplexity score for each document. The class requires the kenlm Python package in addition to the base tldextract dependency.

Usage

Use CCNetPerplexityStats when you need to compute perplexity-based quality scores for documents, following the CCNet methodology. This is particularly valuable for quality-tiered filtering of web-crawled data, where perplexity serves as a proxy for text quality relative to a reference corpus.

Code Reference

Source Location

Signature

class CCNetPerplexityStats(BaseStats):
    name = "🤯 CCNet perplexity stats"
    _requires_dependencies = BaseStats._requires_dependencies + ["kenlm"]

    def __init__(
        self,
        output_folder: DataFolderLike,
        model_dataset: str,
        language: str = Languages.english,
        histogram_round_digits: int = 3,
        groups_to_compute: list[GROUP] = list(get_args(GROUP)),
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None

    def extract_stats(self, doc: Document) -> dict[str, int | float]:
        ...

Import

from datatrove.pipeline.stats.perplexity_stats import CCNetPerplexityStats

I/O Contract

Inputs

Name Type Required Description
output_folder DataFolderLike Yes Folder where computed statistics will be saved
model_dataset str Yes Identifier for the pre-trained KenLM model dataset to use
language str No Language code for the KenLM model (default: Languages.english)
histogram_round_digits int No Decimal digits for histogram rounding (default: 3)
groups_to_compute list[GROUP] No Grouping strategies for statistics (default: all groups)
top_k_config TopKConfig No Top-K configuration for high-cardinality groups

Outputs

Name Type Description
ccnet_perplexity_{model_dataset}_{language} float KenLM perplexity score for the document; lower values indicate text more similar to the model's training data

Usage Examples

Basic Usage

from datatrove.pipeline.stats.perplexity_stats import CCNetPerplexityStats

# Compute perplexity using a Wikipedia-trained KenLM model
stats = CCNetPerplexityStats(
    output_folder="output/stats/",
    model_dataset="wikipedia",
    language="en",
)

Multiple Languages

from datatrove.pipeline.stats.perplexity_stats import CCNetPerplexityStats

# Compute perplexity for French text
stats = CCNetPerplexityStats(
    output_folder="output/stats/fr/",
    model_dataset="wikipedia",
    language="fr",
    groups_to_compute=["summary", "histogram"],
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment