Implementation:Huggingface Datatrove KenlmModel
| Knowledge Sources | |
|---|---|
| Domains | NLP, Perplexity Scoring, Data Quality |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Wraps a KenLM n-gram language model with SentencePiece tokenization to compute perplexity scores for text documents.
Description
The KenlmModel class provides a self-contained interface for computing perplexity scores on text using pre-trained KenLM language models hosted on the Hugging Face Hub. It combines two components: a SentencePiece tokenizer that segments text into subword units, and a KenLM n-gram model that scores the tokenized text. Both components are lazily loaded on first access and cached from the edugp/kenlm repository.
Before scoring, text undergoes a normalization pipeline that includes lowercasing, Unicode diacritic normalization, number normalization, Unicode punctuation replacement, and non-printing character removal. This normalization follows the approach used by CCNet (Facebook's Common Crawl processing pipeline) to ensure consistent perplexity measurements across diverse web text. The normalize method delegates part of its work to simplify_text from the datatrove text utilities module.
The companion SentencePiece class handles subword tokenization by loading a SentencePiece model from the same Hugging Face repository. It encodes text into subword pieces and joins them with spaces, producing the format expected by KenLM for scoring. The overall perplexity is computed line by line, accumulating log-scores and token counts, then converting to a perplexity value via the formula 10^(-log_score / length).
Usage
Use KenlmModel when you need to compute perplexity scores for text quality filtering. It is commonly used in data processing pipelines to identify and remove low-quality or incoherent text from large web crawls. The model is language-specific, so you must specify which language model to load.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/utils/perplexity.py
- Lines: 1-164
Signature
class SentencePiece:
def __init__(self, model_dataset: str, model_name: str): ...
def tokenize(self, text: str) -> str: ...
class KenlmModel:
digit_re: re.Pattern
unicode_punct: Dict[str, str]
unicode_punct_re: re.Pattern
non_printing_chars_re: re.Pattern
def __init__(self, model_dataset: str, language: str): ...
@classmethod
def from_pretrained(cls, model_dataset: str, language: str) -> "KenlmModel": ...
def get_perplexity(self, doc: str, normalize_cc_net: bool = True) -> float: ...
def normalize(self, text: str) -> str: ...
def replace_unicode_punct(self, text: str) -> str: ...
def remove_non_printing_char(self, text: str) -> str: ...
def pp(self, log_score: float, length: int) -> float: ...
Import
from datatrove.utils.perplexity import KenlmModel, SentencePiece
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_dataset | str | Yes | Subdirectory path within the edugp/kenlm repository (e.g., "wikipedia" or "cc_net") |
| language | str | Yes | Language code identifying which model and tokenizer files to load (e.g., "en", "fr") |
| doc | str | Yes | Raw text string to compute perplexity for (passed to get_perplexity) |
| normalize_cc_net | bool | No | Whether to apply CCNet-style text normalization before scoring; defaults to True |
Outputs
| Name | Type | Description |
|---|---|---|
| perplexity | float | The computed perplexity score, rounded to one decimal place; lower values indicate more fluent text |
Usage Examples
Basic Usage
from datatrove.utils.perplexity import KenlmModel
# Load a KenLM model for English from the wikipedia dataset
model = KenlmModel.from_pretrained(model_dataset="wikipedia", language="en")
# Compute perplexity on a text string
text = "The quick brown fox jumps over the lazy dog."
perplexity = model.get_perplexity(text)
print(f"Perplexity: {perplexity}") # e.g., Perplexity: 230.5
Without Normalization
from datatrove.utils.perplexity import KenlmModel
model = KenlmModel(model_dataset="wikipedia", language="fr")
# Skip CCNet normalization if text is already preprocessed
perplexity = model.get_perplexity("Bonjour le monde.", normalize_cc_net=False)
print(f"Perplexity: {perplexity}")