Implementation:Huggingface Datatrove PerplexityStats
| Knowledge Sources | |
|---|---|
| Domains | Data Quality, Natural Language Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
CCNetPerplexityStats is a statistics pipeline step that computes perplexity scores for documents using KenLM language models, following the CCNet methodology.
Description
CCNetPerplexityStats extends BaseStats to compute document-level perplexity using KenLM n-gram language models. Perplexity is a standard metric for measuring how well a language model predicts a text: lower perplexity indicates text that is more "expected" by the model (typically well-formed prose in the model's training language), while higher perplexity indicates unusual, noisy, or out-of-domain text.
The class instantiates a KenlmModel with a specified model_dataset (identifying which pre-trained KenLM model to use) and a language parameter. The CCNet approach, originally developed by Facebook Research, uses perplexity from language models trained on Wikipedia text to classify web documents into quality tiers. Documents with low perplexity are considered high quality (Wikipedia-like), while those with high perplexity are considered lower quality.
The output statistic is named ccnet_perplexity_{model_dataset}_{language} and contains the raw perplexity score for each document. The class requires the kenlm Python package in addition to the base tldextract dependency.
Usage
Use CCNetPerplexityStats when you need to compute perplexity-based quality scores for documents, following the CCNet methodology. This is particularly valuable for quality-tiered filtering of web-crawled data, where perplexity serves as a proxy for text quality relative to a reference corpus.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/stats/perplexity_stats.py
- Lines: 1-37
Signature
class CCNetPerplexityStats(BaseStats):
name = "🤯 CCNet perplexity stats"
_requires_dependencies = BaseStats._requires_dependencies + ["kenlm"]
def __init__(
self,
output_folder: DataFolderLike,
model_dataset: str,
language: str = Languages.english,
histogram_round_digits: int = 3,
groups_to_compute: list[GROUP] = list(get_args(GROUP)),
top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
) -> None
def extract_stats(self, doc: Document) -> dict[str, int | float]:
...
Import
from datatrove.pipeline.stats.perplexity_stats import CCNetPerplexityStats
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes | Folder where computed statistics will be saved |
| model_dataset | str | Yes | Identifier for the pre-trained KenLM model dataset to use |
| language | str | No | Language code for the KenLM model (default: Languages.english) |
| histogram_round_digits | int | No | Decimal digits for histogram rounding (default: 3) |
| groups_to_compute | list[GROUP] | No | Grouping strategies for statistics (default: all groups) |
| top_k_config | TopKConfig | No | Top-K configuration for high-cardinality groups |
Outputs
| Name | Type | Description |
|---|---|---|
| ccnet_perplexity_{model_dataset}_{language} | float | KenLM perplexity score for the document; lower values indicate text more similar to the model's training data |
Usage Examples
Basic Usage
from datatrove.pipeline.stats.perplexity_stats import CCNetPerplexityStats
# Compute perplexity using a Wikipedia-trained KenLM model
stats = CCNetPerplexityStats(
output_folder="output/stats/",
model_dataset="wikipedia",
language="en",
)
Multiple Languages
from datatrove.pipeline.stats.perplexity_stats import CCNetPerplexityStats
# Compute perplexity for French text
stats = CCNetPerplexityStats(
output_folder="output/stats/fr/",
model_dataset="wikipedia",
language="fr",
groups_to_compute=["summary", "histogram"],
)