Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove KenlmModel

From Leeroopedia
Knowledge Sources
Domains NLP, Perplexity Scoring, Data Quality
Last Updated 2026-02-14 17:00 GMT

Overview

Wraps a KenLM n-gram language model with SentencePiece tokenization to compute perplexity scores for text documents.

Description

The KenlmModel class provides a self-contained interface for computing perplexity scores on text using pre-trained KenLM language models hosted on the Hugging Face Hub. It combines two components: a SentencePiece tokenizer that segments text into subword units, and a KenLM n-gram model that scores the tokenized text. Both components are lazily loaded on first access and cached from the edugp/kenlm repository.

Before scoring, text undergoes a normalization pipeline that includes lowercasing, Unicode diacritic normalization, number normalization, Unicode punctuation replacement, and non-printing character removal. This normalization follows the approach used by CCNet (Facebook's Common Crawl processing pipeline) to ensure consistent perplexity measurements across diverse web text. The normalize method delegates part of its work to simplify_text from the datatrove text utilities module.

The companion SentencePiece class handles subword tokenization by loading a SentencePiece model from the same Hugging Face repository. It encodes text into subword pieces and joins them with spaces, producing the format expected by KenLM for scoring. The overall perplexity is computed line by line, accumulating log-scores and token counts, then converting to a perplexity value via the formula 10^(-log_score / length).

Usage

Use KenlmModel when you need to compute perplexity scores for text quality filtering. It is commonly used in data processing pipelines to identify and remove low-quality or incoherent text from large web crawls. The model is language-specific, so you must specify which language model to load.

Code Reference

Source Location

Signature

class SentencePiece:
    def __init__(self, model_dataset: str, model_name: str): ...
    def tokenize(self, text: str) -> str: ...

class KenlmModel:
    digit_re: re.Pattern
    unicode_punct: Dict[str, str]
    unicode_punct_re: re.Pattern
    non_printing_chars_re: re.Pattern

    def __init__(self, model_dataset: str, language: str): ...

    @classmethod
    def from_pretrained(cls, model_dataset: str, language: str) -> "KenlmModel": ...

    def get_perplexity(self, doc: str, normalize_cc_net: bool = True) -> float: ...
    def normalize(self, text: str) -> str: ...
    def replace_unicode_punct(self, text: str) -> str: ...
    def remove_non_printing_char(self, text: str) -> str: ...
    def pp(self, log_score: float, length: int) -> float: ...

Import

from datatrove.utils.perplexity import KenlmModel, SentencePiece

I/O Contract

Inputs

Name Type Required Description
model_dataset str Yes Subdirectory path within the edugp/kenlm repository (e.g., "wikipedia" or "cc_net")
language str Yes Language code identifying which model and tokenizer files to load (e.g., "en", "fr")
doc str Yes Raw text string to compute perplexity for (passed to get_perplexity)
normalize_cc_net bool No Whether to apply CCNet-style text normalization before scoring; defaults to True

Outputs

Name Type Description
perplexity float The computed perplexity score, rounded to one decimal place; lower values indicate more fluent text

Usage Examples

Basic Usage

from datatrove.utils.perplexity import KenlmModel

# Load a KenLM model for English from the wikipedia dataset
model = KenlmModel.from_pretrained(model_dataset="wikipedia", language="en")

# Compute perplexity on a text string
text = "The quick brown fox jumps over the lazy dog."
perplexity = model.get_perplexity(text)
print(f"Perplexity: {perplexity}")  # e.g., Perplexity: 230.5

Without Normalization

from datatrove.utils.perplexity import KenlmModel

model = KenlmModel(model_dataset="wikipedia", language="fr")

# Skip CCNet normalization if text is already preprocessed
perplexity = model.get_perplexity("Bonjour le monde.", normalize_cc_net=False)
print(f"Perplexity: {perplexity}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment