Implementation:Huggingface Datatrove KenlmModel

Knowledge Sources	Huggingface_Datatrove
Domains	NLP, Perplexity Scoring, Data Quality
Last Updated	2026-02-14 17:00 GMT

Overview

Wraps a KenLM n-gram language model with SentencePiece tokenization to compute perplexity scores for text documents.

Description

The KenlmModel class provides a self-contained interface for computing perplexity scores on text using pre-trained KenLM language models hosted on the Hugging Face Hub. It combines two components: a SentencePiece tokenizer that segments text into subword units, and a KenLM n-gram model that scores the tokenized text. Both components are lazily loaded on first access and cached from the edugp/kenlm repository.

Before scoring, text undergoes a normalization pipeline that includes lowercasing, Unicode diacritic normalization, number normalization, Unicode punctuation replacement, and non-printing character removal. This normalization follows the approach used by CCNet (Facebook's Common Crawl processing pipeline) to ensure consistent perplexity measurements across diverse web text. The normalize method delegates part of its work to simplify_text from the datatrove text utilities module.

The companion SentencePiece class handles subword tokenization by loading a SentencePiece model from the same Hugging Face repository. It encodes text into subword pieces and joins them with spaces, producing the format expected by KenLM for scoring. The overall perplexity is computed line by line, accumulating log-scores and token counts, then converting to a perplexity value via the formula 10^(-log_score / length).

Usage

Use KenlmModel when you need to compute perplexity scores for text quality filtering. It is commonly used in data processing pipelines to identify and remove low-quality or incoherent text from large web crawls. The model is language-specific, so you must specify which language model to load.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/utils/perplexity.py
Lines: 1-164

Signature

class SentencePiece:
    def __init__(self, model_dataset: str, model_name: str): ...
    def tokenize(self, text: str) -> str: ...

class KenlmModel:
    digit_re: re.Pattern
    unicode_punct: Dict[str, str]
    unicode_punct_re: re.Pattern
    non_printing_chars_re: re.Pattern

    def __init__(self, model_dataset: str, language: str): ...

    @classmethod
    def from_pretrained(cls, model_dataset: str, language: str) -> "KenlmModel": ...

    def get_perplexity(self, doc: str, normalize_cc_net: bool = True) -> float: ...
    def normalize(self, text: str) -> str: ...
    def replace_unicode_punct(self, text: str) -> str: ...
    def remove_non_printing_char(self, text: str) -> str: ...
    def pp(self, log_score: float, length: int) -> float: ...

Import

from datatrove.utils.perplexity import KenlmModel, SentencePiece

I/O Contract

Inputs

Name	Type	Required	Description
model_dataset	str	Yes	Subdirectory path within the edugp/kenlm repository (e.g., "wikipedia" or "cc_net")
language	str	Yes	Language code identifying which model and tokenizer files to load (e.g., "en", "fr")
doc	str	Yes	Raw text string to compute perplexity for (passed to get_perplexity)
normalize_cc_net	bool	No	Whether to apply CCNet-style text normalization before scoring; defaults to True

Outputs

Name	Type	Description
perplexity	float	The computed perplexity score, rounded to one decimal place; lower values indicate more fluent text

Usage Examples

Basic Usage

from datatrove.utils.perplexity import KenlmModel

# Load a KenLM model for English from the wikipedia dataset
model = KenlmModel.from_pretrained(model_dataset="wikipedia", language="en")

# Compute perplexity on a text string
text = "The quick brown fox jumps over the lazy dog."
perplexity = model.get_perplexity(text)
print(f"Perplexity: {perplexity}")  # e.g., Perplexity: 230.5

Without Normalization

from datatrove.utils.perplexity import KenlmModel

model = KenlmModel(model_dataset="wikipedia", language="fr")

# Skip CCNet normalization if text is already preprocessed
perplexity = model.get_perplexity("Bonjour le monde.", normalize_cc_net=False)
print(f"Perplexity: {perplexity}")

Related Pages

Principle:Huggingface_Datatrove_KenLM_Perplexity_Scoring

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment